[00:15:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:25:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:26:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:27:16] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS13030/IPv6: Idle - Init7, AS13030/IPv4: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:28:34] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:28:34] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:29:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:16] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:30:54] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:31:04] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh! With the exception of a couple of minor issues, I think this is good to go (as before, feel free to merge once those are fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [00:31:06] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:32:16] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:32:25] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:34:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:34:11] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [00:34:16] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:34:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1091.eqiad.wmnet with OS bullseye [00:34:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374269 (10Jclark-ctr) [00:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099834 [00:38:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099834 (owner: 10TrainBranchBot) [00:39:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374271 (10VRiley-WMF) [00:43:34] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:57:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099834 (owner: 10TrainBranchBot) [01:07:38] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099836 [01:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099836 (owner: 10TrainBranchBot) [01:22:27] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:22:54] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:26:25] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:26:35] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:26:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1091.eqiad.wmnet with OS bullseye [01:26:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1091.eqiad.wmnet with OS bullseye executed... [01:29:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10374354 (10phaultfinder) [01:32:52] (03PS1) 10Wziko: feat(cfssl-issuer): change default value for external_services in cfssl issuer helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 [01:34:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099836 (owner: 10TrainBranchBot) [01:35:40] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:35:53] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:36:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [01:36:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1091.eqiad.wmnet with OS bullseye [01:36:55] (03CR) 10Wziko: "Suggestion to handle non existing calico crds" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko) [01:38:48] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:38:58] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:47:48] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:53:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:58:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374383 (10VRiley-WMF) [02:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.6 [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1099839 (https://phabricator.wikimedia.org/T375665) [02:08:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.6 [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1099839 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [02:16:49] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [02:17:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:20:25] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be1084 - vriley@cumin1002" [02:20:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be1084 - vriley@cumin1002" [02:20:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10374416 (10phaultfinder) [02:25:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.6 [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1099839 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [02:29:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:29] FIRING: [4x] ProbeDown: Service wdqs2026:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:32:51] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:33:04] PROBLEM - RPKI Validator RTR port on rpki2003 is CRITICAL: connect to address 10.192.24.3 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [02:33:08] PROBLEM - Routinator process on rpki2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [02:33:50] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2026:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:34:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:25] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:02] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [02:36:42] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:46] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [02:39:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:56:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1091.eqiad.wmnet with OS bullseye [02:56:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10374430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1091.eqiad.wmnet with OS bullseye executed... [02:59:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T0300) [03:01:42] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:24] RECOVERY - dump of x1 in codfw on backupmon1001 is OK: Last dump for x1 at codfw (db2197) taken on 2024-12-03 00:52:03 (59 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:24:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10374433 (10phaultfinder) [03:29:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:30:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:34:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T0400) [04:01:40] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099841 (https://phabricator.wikimedia.org/T375665) [04:01:41] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099841 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [04:02:24] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099841 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [04:02:53] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.6 refs T375665 [04:02:56] T375665: 1.44.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T375665 [04:04:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:25] RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:55] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:04] RECOVERY - RPKI Validator RTR port on rpki2003 is OK: TCP OK - 0.031 second response time on 10.192.24.3 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [04:40:04] PROBLEM - RPKI Validator RTR port on rpki2003 is CRITICAL: connect to address 10.192.24.3 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [04:51:18] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.6 refs T375665 (duration: 48m 24s) [04:51:20] T375665: 1.44.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T375665 [04:54:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T0500) [05:01:29] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.3 (duration: 01m 27s) [05:07:04] RECOVERY - RPKI Validator RTR port on rpki2003 is OK: TCP OK - 0.031 second response time on 10.192.24.3 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [05:10:04] PROBLEM - RPKI Validator RTR port on rpki2003 is CRITICAL: connect to address 10.192.24.3 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [05:15:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:24:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:10] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:44:25] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 12:00:00 on wdqs[2018-2020,2026-2027].codfw.wmnet with reason: T376150 non-prod hosts [05:44:28] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [05:44:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on wdqs[2018-2020,2026-2027].codfw.wmnet with reason: T376150 non-prod hosts [05:47:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P71479 and previous config saved to /var/cache/conftool/dbconfig/20241203-054718-root.json [05:55:50] (03PS1) 10Marostegui: es2041: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1099851 (https://phabricator.wikimedia.org/T381259) [05:57:33] (03CR) 10Marostegui: [C:03+2] es2041: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1099851 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [06:00:21] (03PS1) 10Marostegui: instances.yaml: Add es2041 [puppet] - 10https://gerrit.wikimedia.org/r/1099865 (https://phabricator.wikimedia.org/T381259) [06:00:38] !log [Netbox] T379334 Added VIPs via UI for wdqs-internal-[main,scholarly].svc.[eqiad,codfw].wmnet [06:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:40] T379334: Create DNS records for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379334 [06:00:48] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:01:11] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2041 [puppet] - 10https://gerrit.wikimedia.org/r/1099865 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [06:01:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:02:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P71480 and previous config saved to /var/cache/conftool/dbconfig/20241203-060224-root.json [06:05:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:06:01] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [06:06:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2041 depooled T381259', diff saved to https://phabricator.wikimedia.org/P71481 and previous config saved to /var/cache/conftool/dbconfig/20241203-060614-marostegui.json [06:06:17] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [06:06:36] !log [Netbox] T379334 Aborted netbox sync cookbook due to wrong IPs for wdqs-internal-scholarly. Fixed in UI, re-running cookbook now [06:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:39] T379334: Create DNS records for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379334 [06:06:40] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:08:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2041 to es4 with just minimal weight T381259', diff saved to https://phabricator.wikimedia.org/P71482 and previous config saved to /var/cache/conftool/dbconfig/20241203-060847-marostegui.json [06:10:19] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add VIPs for wdqs-internal-main and wdqs-internal-scholarly - ryankemper@cumin2002" [06:10:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add VIPs for wdqs-internal-main and wdqs-internal-scholarly - ryankemper@cumin2002" [06:10:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:15:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:17:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P71483 and previous config saved to /var/cache/conftool/dbconfig/20241203-061729-root.json [06:19:46] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 96.11 ms [06:26:09] (03PS1) 10Marostegui: mariadb: Productionize es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100008 (https://phabricator.wikimedia.org/T381259) [06:31:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2021.codfw.wmnet with reason: cloning [06:32:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2021.codfw.wmnet with reason: cloning [06:32:01] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100008 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [06:32:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P71484 and previous config saved to /var/cache/conftool/dbconfig/20241203-063234-root.json [06:34:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2022 to es4 master T381259', diff saved to https://phabricator.wikimedia.org/P71485 and previous config saved to /var/cache/conftool/dbconfig/20241203-063408-marostegui.json [06:34:12] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [06:37:19] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:39:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:31] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change VIPs for wdqs-internal-main and wdqs-internal-scholarly to avoid mw-parsoid collision - ryankemper@cumin2002" [06:41:37] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change VIPs for wdqs-internal-main and wdqs-internal-scholarly to avoid mw-parsoid collision - ryankemper@cumin2002" [06:41:37] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:45:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:46:11] (03PS1) 10Ryan Kemper: wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) [06:50:33] (03PS2) 10Ryan Kemper: wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) [06:50:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:51:40] (03PS3) 10Ryan Kemper: wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) [06:56:10] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T0700) [07:00:04] marostegui, Amir1, and arnaudb: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T0700). [07:01:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:02:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098574 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:10:25] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for remaining Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1099190 (owner: 10Muehlenhoff) [07:18:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:21:05] (03CR) 10Arnaudb: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [07:21:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [07:24:18] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1099657 (https://phabricator.wikimedia.org/T374717) (owner: 10Hashar) [07:25:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10374608 (10ops-monitoring-bot) Draining ganeti1009.eqiad.wmnet of running VMs [07:27:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [07:28:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [07:28:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10374613 (10ops-monitoring-bot) Draining ganeti1009.eqiad.wmnet of running VMs [07:42:10] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:48:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [07:50:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [07:51:13] !incidents [07:51:14] 5498 (ACKED) kafka-main1003/Kafka Broker Server (paged) [07:51:14] 5503 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [07:51:14] 5504 (UNACKED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [07:51:14] 5502 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [07:51:14] 5501 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [07:51:14] 5500 (RESOLVED) ProbeDown sre (185.15.58.225 ip4 text-https:443 probes/service http_text-https_ip4 drmrs) [07:51:28] !ack 5503 [07:51:28] 5503 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [07:51:30] !ack 5504 [07:51:31] 5504 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [07:53:18] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks a lot! Applies cleanly to git master apart from whitespace warnings so I'm merging this." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089939 (owner: 10Pppery) [07:55:46] (03CR) 10Muehlenhoff: [C:03+2] an-web: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1099195 (owner: 10Muehlenhoff) [07:57:18] !log Switchover es4 codfw master to es2022 dbmaint (this happened an hour ago) T381259 [07:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:24] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [07:57:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2021', diff saved to https://phabricator.wikimedia.org/P71486 and previous config saved to /var/cache/conftool/dbconfig/20241203-075751-marostegui.json [07:58:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2021.codfw.wmnet with reason: cloning [07:58:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2021.codfw.wmnet with reason: cloning [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:37] (03CR) 10Stevemunene: [C:03+1] airflow-wmde: point to the cloudnative-pg cluster in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099197 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [08:02:02] (03CR) 10Stevemunene: [C:03+1] postgresql-airflow-wmde: add helmfiles and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099196 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [08:06:59] (03CR) 10Muehlenhoff: [C:03+2] grafana: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1099192 (owner: 10Muehlenhoff) [08:07:01] (03CR) 10DCausse: rdf-streaming-updater: add wdqs udpater streams in event stream config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [08:07:26] (03PS4) 10BCornwall: haproxy: Remove RSA certificate support [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) [08:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1213', diff saved to https://phabricator.wikimedia.org/P71487 and previous config saved to /var/cache/conftool/dbconfig/20241203-080726-marostegui.json [08:08:25] (03CR) 10DCausse: [C:03+2] flink-app: add a component label to the flink-app configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098885 (owner: 10DCausse) [08:08:44] Hey @Amir1 or @urbanecm, are you (or another deployer) around by chance? It would be nice to get [Growth: enable temporary Surfacing Alpha on pilot wikis](https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1099690) deployed [08:08:55] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4621/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [08:08:55] We have the green light now :) [08:09:03] But it can also wait till the next window [08:09:23] Hey MichaelG_WMF, okay, let's do that then. [08:09:34] (03PS1) 10Marostegui: mariadb: Move db1213 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1100040 (https://phabricator.wikimedia.org/T375593) [08:09:36] (03CR) 10DCausse: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [08:09:39] 🙌 [08:09:45] (03Merged) 10jenkins-bot: flink-app: add a component label to the flink-app configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098885 (owner: 10DCausse) [08:10:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1213.eqiad.wmnet with reason: Moving to m3 [08:10:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1213.eqiad.wmnet with reason: Moving to m3 [08:11:02] (03PS3) 10Michael Große: Growth: enable temporary Surfacing Alpha on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099690 (https://phabricator.wikimedia.org/T379976) [08:11:04] (03CR) 10Urbanecm: [C:03+2] Growth: enable temporary Surfacing Alpha on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099690 (https://phabricator.wikimedia.org/T379976) (owner: 10Michael Große) [08:11:37] (03PS2) 10Marostegui: mariadb: Move db1213 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1100040 (https://phabricator.wikimedia.org/T375593) [08:11:37] (03PS1) 10Marostegui: instances.yaml: Remove db1213 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1100041 (https://phabricator.wikimedia.org/T375593) [08:12:05] (03Merged) 10jenkins-bot: Growth: enable temporary Surfacing Alpha on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099690 (https://phabricator.wikimedia.org/T379976) (owner: 10Michael Große) [08:12:27] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1213 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1100040 (https://phabricator.wikimedia.org/T375593) (owner: 10Marostegui) [08:12:36] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1213 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1100041 (https://phabricator.wikimedia.org/T375593) (owner: 10Marostegui) [08:13:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1217.eqiad.wmnet with reason: Moving to m3 [08:13:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1217.eqiad.wmnet with reason: Moving to m3 [08:13:39] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099690|Growth: enable temporary Surfacing Alpha on pilot wikis (T379976)]] [08:13:41] T379976: Surfacing "Add a link" Structured Tasks: Alpha Release Plan and Release Task (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T379976 [08:14:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1213 from dbctl T375593', diff saved to https://phabricator.wikimedia.org/P71489 and previous config saved to /var/cache/conftool/dbconfig/20241203-081434-marostegui.json [08:14:37] T375593: [misc] db1159 data corruption - https://phabricator.wikimedia.org/T375593 [08:16:36] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:17:10] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:17:26] ^ expected [08:21:32] (03CR) 10Vgutierrez: [C:03+1] icinga: Remove RSA cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [08:21:36] !log urbanecm@deploy2002 urbanecm, migr: Backport for [[gerrit:1099690|Growth: enable temporary Surfacing Alpha on pilot wikis (T379976)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:21:39] T379976: Surfacing "Add a link" Structured Tasks: Alpha Release Plan and Release Task (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T379976 [08:21:49] * MichaelG_WMF looks [08:21:57] Ty :) [08:22:06] (03CR) 10Vgutierrez: [C:03+1] "nice job @bcornwall@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [08:23:40] vgutierrez: in case you're around already: We got paged for 'Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org)' - but we could not find an explicit reason as of now [08:23:55] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10374687 (10SuzanneWood-WMDE) suzanne.wood@wikimedia.de [08:24:25] jayme: what's the timestamp? [08:24:50] ok.. I'm already seeing it on grafana [08:25:00] 07:48 [08:25:09] @urbanecm it works, though I notice a problem with the thumbnails. But that is just a minor bug and should not block this [08:25:31] seems like swift traffic is already going down again https://grafana.wikimedia.org/d/pXnJdJ17k/all-clusters-network-traffic-traffic?from=now-3h&orgId=1&to=now&var-cluster=All&var-datasource=thanos&var-site=eqiad&viewPanel=133 [08:25:52] MichaelG_WMF: okay, ty. do you mind filling it too? [08:26:00] !log urbanecm@deploy2002 urbanecm, migr: Continuing with sync [08:26:27] will do, after poking at it a bit more^^ [08:26:49] ty! [08:27:53] !log installing unbound security updates [08:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:30:19] figured out the problem of missing thumbnails: I was missing the `CdxThumbnail` in the RL module definition in extension.json => filing tasks and then creating a fix [08:30:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [08:31:47] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:32:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:32:07] (03CR) 10Wangombe: "sq" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:32:36] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10374698 (10elukey) >>! In T380487#10373389, @KFrancis wrote: > Hi all, please send me Suzanne Wood's email address and I will process the NDA. Thanks! @KFrancis should be suzan... [08:32:52] MichaelG_WMF: ty! [08:33:00] i looked at instrumentation, seems to be working correctly too [08:34:10] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [08:35:10] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099690|Growth: enable temporary Surfacing Alpha on pilot wikis (T379976)]] (duration: 21m 30s) [08:35:12] T379976: Surfacing "Add a link" Structured Tasks: Alpha Release Plan and Release Task (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T379976 [08:35:16] should be live [08:35:29] (03CR) 10Arnaudb: [C:03+1] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [08:37:55] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1091.eqiad.wmnet with OS bullseye [08:39:23] @urbanecm Thanks! [08:39:36] Task has been created for the bug that was notcied: [T381364 Surfacing Popups missing thumbnail image](https://phabricator.wikimedia.org/T381364) [08:39:36] T381364: Surfacing Popups missing thumbnail image - https://phabricator.wikimedia.org/T381364 [08:43:41] (03PS1) 10Jelto: add Anexia to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1100044 [08:45:44] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: parse2017.codfw.wmnet [08:45:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: parse2017.codfw.wmnet [08:47:55] (03CR) 10JMeybohm: [C:03+1] add Anexia to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1100044 (owner: 10Jelto) [08:49:07] (03CR) 10Vgutierrez: [C:03+1] add Anexia to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1100044 (owner: 10Jelto) [08:53:10] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:53:36] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:54:48] (03PS5) 10Arnaudb: alerts: enable paging mariadb through prometheus [alerts] - 10https://gerrit.wikimedia.org/r/1100042 (https://phabricator.wikimedia.org/T381276) [08:54:48] (03CR) 10Arnaudb: "please specially check the durations before pages. we don't want to page too soon, not too late either. its supposed to be escalating from" [alerts] - 10https://gerrit.wikimedia.org/r/1100042 (https://phabricator.wikimedia.org/T381276) (owner: 10Arnaudb) [08:56:28] (03CR) 10Jelto: [C:03+2] add Anexia to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1100044 (owner: 10Jelto) [09:06:00] (03PS1) 10Marostegui: dbproxy1020,dbproxy1026: Test new host [puppet] - 10https://gerrit.wikimedia.org/r/1100048 (https://phabricator.wikimedia.org/T381365) [09:06:54] (03CR) 10Marostegui: [C:03+2] dbproxy1020,dbproxy1026: Test new host [puppet] - 10https://gerrit.wikimedia.org/r/1100048 (https://phabricator.wikimedia.org/T381365) (owner: 10Marostegui) [09:08:59] (03PS1) 10Marostegui: Revert "dbproxy1020,dbproxy1026: Test new host" [puppet] - 10https://gerrit.wikimedia.org/r/1100049 [09:09:06] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1100049 (owner: 10Marostegui) [09:09:13] (03CR) 10Arnaudb: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [09:11:14] (03CR) 10Marostegui: "Tests are good" [puppet] - 10https://gerrit.wikimedia.org/r/1100049 (owner: 10Marostegui) [09:11:18] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1020,dbproxy1026: Test new host" [puppet] - 10https://gerrit.wikimedia.org/r/1100049 (owner: 10Marostegui) [09:15:05] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1099784 (https://phabricator.wikimedia.org/T381317) (owner: 10Cwhite) [09:17:12] (03CR) 10Marostegui: "Let's get someone from o11y to review this. I am not that used to prometheus paging." [alerts] - 10https://gerrit.wikimedia.org/r/1100042 (https://phabricator.wikimedia.org/T381276) (owner: 10Arnaudb) [09:21:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71492 and previous config saved to /var/cache/conftool/dbconfig/20241203-092122-root.json [09:22:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [09:22:02] (03CR) 10Tiziano Fogli: mariadb: add innodb buffer pool usage monitoring (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [09:24:47] !log removing ganeti1009 from active Ganeti nodes T378921 [09:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:49] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [09:25:12] (03PS1) 10Marostegui: mariadb: Add db2241 and db2242 in setup [puppet] - 10https://gerrit.wikimedia.org/r/1100050 (https://phabricator.wikimedia.org/T379757) [09:25:13] (03CR) 10Brouberol: [C:03+2] postgresql-airflow-wmde: add helmfiles and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099196 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [09:26:08] (03PS1) 10Muehlenhoff: ganeti1009: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1100051 [09:27:01] !log rebalance Ganeti eqiad/A following server refreshes [09:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:37] PROBLEM - ganeti-confd running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:27:37] PROBLEM - ganeti-noded running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:27:40] (03CR) 10Marostegui: [C:03+2] mariadb: Add db2241 and db2242 in setup [puppet] - 10https://gerrit.wikimedia.org/r/1100050 (https://phabricator.wikimedia.org/T379757) (owner: 10Marostegui) [09:28:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10374857 (10Marostegui) >>! In T379757#10372097, @Jhancock.wm wrote: > @Marostegui we got these servers in today. I'm gonna try to get them ready asap. I wasn'... [09:28:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10374858 (10Marostegui) [09:29:09] (03CR) 10Muehlenhoff: [C:03+2] ganeti1009: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1100051 (owner: 10Muehlenhoff) [09:29:13] FIRING: [13x] ProbeDown: Service ganeti1009:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:22] (03CR) 10Fabfur: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 (owner: 10Ssingh) [09:31:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [09:31:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [09:35:52] (03CR) 10Brouberol: [C:03+2] airflow-wmde: point to the cloudnative-pg cluster in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099197 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [09:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71493 and previous config saved to /var/cache/conftool/dbconfig/20241203-093627-root.json [09:36:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [09:38:06] (03PS1) 10Muehlenhoff: ganeti1009: Update site.pp even more [puppet] - 10https://gerrit.wikimedia.org/r/1100054 [09:38:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [09:40:19] !log homer 'cr*eqiad*' commit 'T377876' [09:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:22] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [09:41:12] (03CR) 10Muehlenhoff: [C:03+2] ganeti1009: Update site.pp even more [puppet] - 10https://gerrit.wikimedia.org/r/1100054 (owner: 10Muehlenhoff) [09:50:42] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Connect - Orange, AS5511/IPv6: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:51:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71494 and previous config saved to /var/cache/conftool/dbconfig/20241203-095133-root.json [09:51:46] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:52:08] FIRING: [13x] ProbeDown: Service ganeti1009:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:54] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1006.eqiad.wmnet [09:52:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1006.eqiad.wmnet [09:53:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [09:53:50] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268#10374959 (10Jelto) [09:53:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [09:54:38] (03CR) 10Marostegui: "This looks very useful, can you paste a run execution example? Thanks" [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) (owner: 10Arnaudb) [09:55:17] (03CR) 10Arnaudb: "adding @tfogli@wikimedia.org to the reviewers 😊" [alerts] - 10https://gerrit.wikimedia.org/r/1100042 (https://phabricator.wikimedia.org/T381276) (owner: 10Arnaudb) [09:57:25] (03CR) 10Arnaudb: "Sure, you got one in the attached task: https://phabricator.wikimedia.org/P71043" [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) (owner: 10Arnaudb) [09:57:37] (03PS1) 10Klausman: modules/admin: add sbisson to ML deployers on ml-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1100057 (https://phabricator.wikimedia.org/T381108) [10:03:39] (03PS1) 10Slyngshede: Password update: avoid triggering invalid hash error [software/bitu] - 10https://gerrit.wikimedia.org/r/1100058 (https://phabricator.wikimedia.org/T381327) [10:04:23] (03CR) 10Marostegui: dbtools: command line helper to evaluate a host, or a group of hosts (034 comments) [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) (owner: 10Arnaudb) [10:04:45] (03PS1) 10Vgutierrez: hiera,haproxy: Set a bw limit per IP on upload@esams [puppet] - 10https://gerrit.wikimedia.org/r/1100059 [10:05:11] (03PS2) 10Vgutierrez: hiera,haproxy: Set a bw limit per IP on upload@esams [puppet] - 10https://gerrit.wikimedia.org/r/1100059 [10:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71495 and previous config saved to /var/cache/conftool/dbconfig/20241203-100638-root.json [10:07:57] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100059 (owner: 10Vgutierrez) [10:09:18] (03PS3) 10Vgutierrez: hiera,haproxy: Set a bw limit per IP on upload@esams [puppet] - 10https://gerrit.wikimedia.org/r/1100059 [10:09:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100059 (owner: 10Vgutierrez) [10:15:27] (03CR) 10Giuseppe Lavagetto: [C:03+1] "The compiled output looks ok to me, of course I didn't run any tests." [puppet] - 10https://gerrit.wikimedia.org/r/1100059 (owner: 10Vgutierrez) [10:16:23] !log bking@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1002" [10:16:24] !log bking@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1027.eqiad.wmnet with OS bullseye [10:16:29] !log robh@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin2002" [10:16:30] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7004.magru.wmnet with OS bookworm [10:16:36] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10375008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS boo... [10:16:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10375007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye completed: - wdqs1027... [10:18:10] (03CR) 10Fabfur: [C:03+1] hiera,haproxy: Set a bw limit per IP on upload@esams [puppet] - 10https://gerrit.wikimedia.org/r/1100059 (owner: 10Vgutierrez) [10:19:05] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002" [10:19:10] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002" [10:19:50] (03CR) 10Vgutierrez: [C:03+2] hiera,haproxy: Set a bw limit per IP on upload@esams [puppet] - 10https://gerrit.wikimedia.org/r/1100059 (owner: 10Vgutierrez) [10:21:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71496 and previous config saved to /var/cache/conftool/dbconfig/20241203-102143-root.json [10:26:06] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10375032 (10Volans) FYI I have aborted the last reimage execution that was at the last step waiting for use input for the netbox-hiera int... [10:27:06] !log volans@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:27:16] (03PS1) 10Klausman: modules/admin: Add Ilias as approver for ml-lab users [puppet] - 10https://gerrit.wikimedia.org/r/1100063 [10:29:32] (03PS2) 10Klausman: modules/admin: Add Ilias as approver for k8s-ml deployers, ml-lab users & ml-team admins [puppet] - 10https://gerrit.wikimedia.org/r/1100063 [10:30:23] (03CR) 10CI reject: [V:04-1] modules/admin: Add Ilias as approver for k8s-ml deployers, ml-lab users & ml-team admins [puppet] - 10https://gerrit.wikimedia.org/r/1100063 (owner: 10Klausman) [10:30:24] !log volans@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1061.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:31:05] (03CR) 10Ilias Sarantopoulos: [C:03+1] modules/admin: add sbisson to ML deployers on ml-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1100057 (https://phabricator.wikimedia.org/T381108) (owner: 10Klausman) [10:31:18] (03CR) 10Ilias Sarantopoulos: [C:03+1] modules/admin: Add Ilias as approver for k8s-ml deployers, ml-lab users & ml-team admins [puppet] - 10https://gerrit.wikimedia.org/r/1100063 (owner: 10Klausman) [10:31:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10375038 (10Volans) FYI I have aborted the last reimage execution that was at the last step waiting for use input for the netbox-hiera integration sync. Those chan... [10:31:41] (03PS3) 10Klausman: modules/admin: Add Ilias as approver for various ML-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1100063 [10:34:49] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Connect - Orange, AS5511/IPv6: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:51] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:36:38] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API, 13Patch-For-Review: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10375065 (10klausman) a:03klausman [10:39:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:42] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1019-1020].eqiad.wmnet [10:42:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1019-1020].eqiad.wmnet [10:47:32] (03PS1) 10Jelto: Rename kubernetes1019 and kubernetes1020 [puppet] - 10https://gerrit.wikimedia.org/r/1100069 (https://phabricator.wikimedia.org/T377876) [10:47:40] (03PS1) 10Elukey: Add the mapnik image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) [10:49:05] (03PS1) 10Arnaudb: mariadb: db2239 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100068 [10:49:32] !log installed spicerack v9.0.0 on cumin[12]002 [10:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:55] (03CR) 10Volans: [V:03+2 C:03+2] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [10:50:06] (03CR) 10Volans: [C:03+2] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [10:50:19] (03CR) 10Volans: [C:03+2] cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 (owner: 10Volans) [10:52:18] (03CR) 10Elukey: "Follow up change for Kartotherian: https://gerrit.wikimedia.org/r/c/mediawiki/services/kartotherian/+/1100066" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [10:56:27] (03Merged) 10jenkins-bot: cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 (owner: 10Volans) [10:57:25] (03CR) 10Marostegui: [C:04-1] "The yaml file also needs to be restored back to the original" [puppet] - 10https://gerrit.wikimedia.org/r/1100068 (owner: 10Arnaudb) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1100) [11:00:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:03:54] (03PS2) 10Arnaudb: mariadb: db2239 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100068 [11:04:11] (03CR) 10Arnaudb: "removes hieradata content" [puppet] - 10https://gerrit.wikimedia.org/r/1100068 (owner: 10Arnaudb) [11:04:24] (03CR) 10Marostegui: [C:03+1] mariadb: db2239 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100068 (owner: 10Arnaudb) [11:04:50] (03CR) 10Marostegui: [C:03+1] "yes, that one." [puppet] - 10https://gerrit.wikimedia.org/r/1100068 (owner: 10Arnaudb) [11:05:08] (03CR) 10Arnaudb: [C:03+2] mariadb: db2239 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100068 (owner: 10Arnaudb) [11:16:24] (03PS1) 10Jelto: wikidata-query-gui: bump image version after merging gerrit remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100074 (https://phabricator.wikimedia.org/T350793) [11:18:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [11:20:03] (03PS1) 10Cathal Mooney: Block PAWS workers nodes from all UDP traffic other than DNS & NTP [puppet] - 10https://gerrit.wikimedia.org/r/1100077 (https://phabricator.wikimedia.org/T381078) [11:20:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2025 to clone es2046', diff saved to https://phabricator.wikimedia.org/P71497 and previous config saved to /var/cache/conftool/dbconfig/20241203-112015-marostegui.json [11:20:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2025.codfw.wmnet with reason: cloning [11:20:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2025.codfw.wmnet with reason: cloning [11:22:59] (03PS1) 10Marostegui: mariadb: Productionize es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1100079 (https://phabricator.wikimedia.org/T381259) [11:24:15] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1100079 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [11:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10375159 (10phaultfinder) [11:26:20] (03PS1) 10Marostegui: es2041: Host in production [puppet] - 10https://gerrit.wikimedia.org/r/1100081 [11:27:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1100077 (https://phabricator.wikimedia.org/T381078) (owner: 10Cathal Mooney) [11:27:45] (03CR) 10Marostegui: [C:03+2] es2041: Host in production [puppet] - 10https://gerrit.wikimedia.org/r/1100081 (owner: 10Marostegui) [11:27:55] (03CR) 10FNegri: [C:03+1] "This looks good as a stopgap, I'm researching how to properly filter outbound traffic from the PAWS k8s cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1100077 (https://phabricator.wikimedia.org/T381078) (owner: 10Cathal Mooney) [11:29:14] (03CR) 10Cathal Mooney: [C:03+2] Block PAWS workers nodes from all UDP traffic other than DNS & NTP [puppet] - 10https://gerrit.wikimedia.org/r/1100077 (https://phabricator.wikimedia.org/T381078) (owner: 10Cathal Mooney) [11:30:07] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes1019 and kubernetes1020 [puppet] - 10https://gerrit.wikimedia.org/r/1100069 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [11:31:51] !log pushing new nftables rules to cloudgw1001 to block abuse from paws T381078 [11:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:54] T381078: cloudgw: suspected network problems - https://phabricator.wikimedia.org/T381078 [11:32:17] (03CR) 10Jelto: [C:03+2] Rename kubernetes1019 and kubernetes1020 [puppet] - 10https://gerrit.wikimedia.org/r/1100069 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [11:32:23] !log volans@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:32:50] (03PS1) 10Arthur taylor: Remove EntitySchema DataType feature flag - is always enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) [11:33:05] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1061.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:36:53] (03PS1) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) [11:37:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:37:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:37:17] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1019 to wikikube-worker1015 [11:37:38] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:39:39] (03PS1) 10Cathal Mooney: Fix syntax errors in nft rules [puppet] - 10https://gerrit.wikimedia.org/r/1100087 (https://phabricator.wikimedia.org/T381078) [11:40:44] (03CR) 10FNegri: [C:03+1] Fix syntax errors in nft rules [puppet] - 10https://gerrit.wikimedia.org/r/1100087 (https://phabricator.wikimedia.org/T381078) (owner: 10Cathal Mooney) [11:41:41] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1019 to wikikube-worker1015 - jelto@cumin1002" [11:42:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1019 to wikikube-worker1015 - jelto@cumin1002" [11:42:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:42:17] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1015 [11:42:36] (03CR) 10Cathal Mooney: [C:03+2] Fix syntax errors in nft rules [puppet] - 10https://gerrit.wikimedia.org/r/1100087 (https://phabricator.wikimedia.org/T381078) (owner: 10Cathal Mooney) [11:42:38] (03CR) 10FNegri: [C:03+1] "> I'm researching how to properly filter outbound traffic from the PAWS k8s cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1100077 (https://phabricator.wikimedia.org/T381078) (owner: 10Cathal Mooney) [11:43:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1015 [11:44:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1019 to wikikube-worker1015 [11:44:53] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1020 to wikikube-worker1016 [11:45:13] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:48:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1100058 (https://phabricator.wikimedia.org/T381327) (owner: 10Slyngshede) [11:49:16] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1020 to wikikube-worker1016 - jelto@cumin1002" [11:49:37] (03CR) 10Slyngshede: [C:03+2] Password update: avoid triggering invalid hash error [software/bitu] - 10https://gerrit.wikimedia.org/r/1100058 (https://phabricator.wikimedia.org/T381327) (owner: 10Slyngshede) [11:49:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1020 to wikikube-worker1016 - jelto@cumin1002" [11:49:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:43] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1016 [11:50:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1016 [11:50:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1020 to wikikube-worker1016 [11:51:25] (03CR) 10Muehlenhoff: "This only needs approval by an existing approver, IOW Chris and can then be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1100063 (owner: 10Klausman) [11:51:51] (03Merged) 10jenkins-bot: Password update: avoid triggering invalid hash error [software/bitu] - 10https://gerrit.wikimedia.org/r/1100058 (https://phabricator.wikimedia.org/T381327) (owner: 10Slyngshede) [11:53:08] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache kubernetes1019.eqiad.wmnet wikikube-worker1015.eqiad.wmnet kubernetes1020.eqiad.wmnet wikikube-worker1016.eqiad.wmnet on all recursors [11:53:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubernetes1019.eqiad.wmnet wikikube-worker1015.eqiad.wmnet kubernetes1020.eqiad.wmnet wikikube-worker1016.eqiad.wmnet on all recursors [11:54:25] (03PS2) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [11:58:28] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1015.eqiad.wmnet with OS bookworm [12:00:00] (03CR) 10Gmodena: [C:03+1] rdf-streaming-updater: add wdqs udpater streams in event stream config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [12:00:32] (03PS1) 10Marostegui: mariadb: Add db125(5,6) to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100091 (https://phabricator.wikimedia.org/T379753) [12:02:55] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1100091 (https://phabricator.wikimedia.org/T379753) (owner: 10Marostegui) [12:03:17] (03CR) 10Marostegui: [C:03+2] mariadb: Add db125(5,6) to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100091 (https://phabricator.wikimedia.org/T379753) (owner: 10Marostegui) [12:05:54] (03PS3) 10Muehlenhoff: Assign builder role to build2002 (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) [12:07:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10375308 (10Marostegui) a:05Marostegui→03None >>! In T379753#10316745, @RobH wrote: > @Marostegui, > > Please note the workflow for racking tasks has chan... [12:07:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10375311 (10Marostegui) [12:08:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [12:30:25] (03CR) 10Marostegui: [C:03+2] mariadb: Set db125[0-4] insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100096 (https://phabricator.wikimedia.org/T380083) (owner: 10Marostegui) [12:30:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10375377 (10Marostegui) a:05Marostegui→03None >>! In T380083#10327817, @RobH wrote: > Please note the workflow for racking tasks has changed this fiscal y... [12:31:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10375382 (10Marostegui) [12:35:25] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [12:36:20] (03PS21) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [12:36:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1015.eqiad.wmnet with OS bookworm [12:37:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1016.eqiad.wmnet with OS bookworm [12:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10375402 (10phaultfinder) [12:42:49] (03PS1) 10Gerrit maintenance bot: Add tig to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1100097 (https://phabricator.wikimedia.org/T381377) [12:43:31] jouncebot: nowandnext [12:43:31] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [12:43:31] In 0 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1300) [12:46:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10375487 (10MoritzMuehlenhoff) [12:47:59] !log klausman@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ml-lab1001.eqiad.wmnet [12:49:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10375503 (10MoritzMuehlenhoff) [12:53:06] !log jnuche@deploy2002 Installing scap version "4.132.0" for 1 host(s) [12:54:03] !log jnuche@deploy2002 Installation of scap version "4.132.0" completed for 1 hosts [12:54:12] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1016.eqiad.wmnet with reason: host reimage [12:55:16] (03PS3) 10Máté Szabó: Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) [12:55:17] (03PS1) 10Máté Szabó: Prep IRS config for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100101 [12:55:29] !log jnuche@deploy2002 Installing scap version "4.132.0" for 1 host(s) [12:56:18] !log jnuche@deploy2002 Installation of scap version "4.132.0" completed for 1 hosts [12:57:05] !log jnuche@deploy2002 Installing scap version "4.132.0" for 207 host(s) [12:57:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1016.eqiad.wmnet with reason: host reimage [12:59:01] (03CR) 10Effie Mouzeli: [C:03+1] "Looks alright! One nit, I think we should add in the commit message that at the moment, it is not possible to package mapnik (as it has al" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [13:00:01] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1300) [13:04:12] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1012.eqiad.wmnet [13:06:02] !log jnuche@deploy2002 Installing scap version "4.132.0" for 1 host(s) [13:06:59] !log jnuche@deploy2002 Installation of scap version "4.132.0" completed for 1 hosts [13:07:44] (03PS1) 10Michael Große: fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) [13:08:04] (03PS1) 10Michael Große: fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100103 (https://phabricator.wikimedia.org/T381364) [13:08:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100103 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [13:09:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [13:09:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:10:00] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:10:01] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:10:14] (03CR) 10Jelto: [C:03+1] wikidata-query-gui: bump image version after merging gerrit remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100074 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:10:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:10:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] wikidata-query-gui: bump image version after merging gerrit remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100074 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:10:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:02] (03Merged) 10jenkins-bot: wikidata-query-gui: bump image version after merging gerrit remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100074 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:13:03] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:13:03] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:13:40] (03CR) 10Ladsgroup: [C:03+2] Add tig to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1100097 (https://phabricator.wikimedia.org/T381377) (owner: 10Gerrit maintenance bot) [13:13:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:14:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:14:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:14:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1012.eqiad.wmnet [13:14:57] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1022.eqiad.wmnet [13:15:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1016.eqiad.wmnet with OS bookworm [13:18:44] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:18:44] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:18:53] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:19:16] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:19:41] 06SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile HDFS analytics traffic - https://phabricator.wikimedia.org/T381389 (10cmooney) 03NEW p:05Triage→03Medium [13:19:52] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:19:52] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:20:20] 06SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10375639 (10cmooney) [13:20:21] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:20:46] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:21:19] 06SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10375643 (10cmooney) [13:21:39] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:22:25] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [13:22:42] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [13:22:49] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [13:23:17] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [13:23:52] (03CR) 10CI reject: [V:04-1] fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100103 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [13:24:17] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:25:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:25:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:25:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1022.eqiad.wmnet [13:26:08] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission ganeti1012 / ganeti1022 - https://phabricator.wikimedia.org/T381385#10375656 (10MoritzMuehlenhoff) [13:26:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 (owner: 10Bartosz Dziewoński) [13:27:17] (03CR) 10CI reject: [V:04-1] fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [13:28:35] !log upgrade haproxykafka to version 0.3.4 (https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/commits/main?ref_type=heads) (T380583) [13:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:38] T380583: Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583 [13:29:39] (03CR) 10Vgutierrez: [C:03+1] "Oh! in this CR or in a following one please remove `,regsub('^ECDHE-RSA-','')` from common/profile/cache/haproxy.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [13:30:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:30:15] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:32:23] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:32:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.restart_sanitarium (exit_code=0) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:33:32] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:33:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.restart_sanitarium (exit_code=0) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:34:24] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:34:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.restart_sanitarium (exit_code=0) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:34:55] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:35:06] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:38:58] (03CR) 10Lucas Werkmeister (WMDE): trafficserver: switch query-scholarly to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:39:15] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10375702 (10RobH) >>! In T380307#10375032, @Volans wrote: > FYI I have aborted the last reimage execution that was at the last step waitin... [13:39:16] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:39:26] (03CR) 10Jelto: [C:03+2] trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:39:27] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:40:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:40:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.restart_sanitarium (exit_code=0) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:41:00] !log arnaudb@cumin1002 START - Cookbook sre.mysql.restart_sanitarium Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:41:11] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.restart_sanitarium (exit_code=99) Restart a pool of Sanitarium MariaDB instances and/or hosts. [13:41:17] (03CR) 10Michael Große: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [13:42:02] (03PS2) 10Anzx: knwiki: remove module namespace names from core-Namespaces.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100110 (https://phabricator.wikimedia.org/T346583) [13:42:15] (03PS1) 10Alexandros Kosiaris: gateway-check: Make indentation consistent [puppet] - 10https://gerrit.wikimedia.org/r/1100111 [13:42:15] (03PS1) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [13:43:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100110 (https://phabricator.wikimedia.org/T346583) (owner: 10Anzx) [13:44:48] !incidents [13:44:48] 5504 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [13:44:48] 5503 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:44:49] 5502 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:44:49] 5501 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [13:44:49] 5500 (RESOLVED) ProbeDown sre (185.15.58.225 ip4 text-https:443 probes/service http_text-https_ip4 drmrs) [13:44:56] got paged agian by kafka-main1003 [13:44:59] me too [13:45:03] downtime expired? [13:45:06] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [13:45:08] !incidents [13:45:09] 5504 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [13:45:09] 5503 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:45:09] 5502 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:45:09] 5501 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [13:45:09] 5500 (RESOLVED) ProbeDown sre (185.15.58.225 ip4 text-https:443 probes/service http_text-https_ip4 drmrs) [13:45:17] but there's no incident :) [13:45:19] !rack 5498 [13:45:21] !ack 5498 [13:45:21] 5498 (ACKED) kafka-main1003/Kafka Broker Server (paged) [13:45:31] it's too old to appear there jayme [13:45:35] ah... [13:45:37] hello [13:45:42] effie: ^^ [13:45:51] effie: you owe another 🍺 to us, now you need to include sukhe as well [13:45:53] why again, why [13:46:09] should we mark this as resolved then :) [13:46:11] effie: do you wanna to get it flagged as resolved on Splunk? [13:46:17] yeah... let's try that bugfix [13:46:21] !resolve 5498 [13:46:21] +1 [13:46:21] 5498 (RESOLVED) kafka-main1003/Kafka Broker Server (paged) [13:46:25] lovely [13:46:30] (03PS2) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [13:46:48] we should kept silent and continue increasing the beer counter though [13:46:51] ok I thought it pa-ged again, so I was very puzzled [13:46:55] phew [13:47:11] I owe you nothing this time [13:47:44] I dissagree :) [13:47:49] 🍻 [13:48:44] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [13:50:02] (03PS1) 10Fabfur: cache:haproxy: Accept-Language header size to 96 [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [13:51:28] (03CR) 10Ssingh: [C:03+1] "Looks good!" [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) (owner: 10Fabfur) [13:52:08] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:18] (03CR) 10Fabfur: [C:04-2] "Thanks, putting on hold until Dec 11 @11.00 UTC" [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) (owner: 10Fabfur) [13:54:27] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:53] !log homer 'cr*eqiad*' commit 'T377876' [13:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:55] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [13:58:24] (03PS3) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1400). [14:00:06] chlod, MichaelG_WMF, MatmaRex, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] hi [14:00:25] o/ [14:00:26] o/ [14:00:28] o/ [14:00:34] o/ [14:00:43] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:00:46] (03PS4) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [14:00:47] o/ [14:00:51] my patch is a no-op, it can't really be tested in mwdebug [14:00:51] i can deploy today [14:01:07] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4622/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) (owner: 10CDobbins) [14:01:21] (03CR) 10Urbanecm: [C:03+2] fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [14:01:24] (03CR) 10Urbanecm: [C:03+2] fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100103 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [14:01:45] (03PS1) 10Arnaudb: mysql: add unix_socket [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [14:02:02] (03PS1) 10Kosta Harlan: dialog: Don't duplicate the footer in the behaviour list template [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100117 (https://phabricator.wikimedia.org/T381189) [14:02:06] (03CR) 10Urbanecm: [C:03+2] Increase Nuke max age to 90 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098574 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [14:02:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100117 (https://phabricator.wikimedia.org/T381189) (owner: 10Kosta Harlan) [14:02:52] (03CR) 10Urbanecm: [C:03+2] "Scribunto change is in both wmf.5 and wmf.6, should be a no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100110 (https://phabricator.wikimedia.org/T346583) (owner: 10Anzx) [14:03:02] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:03:05] (03Merged) 10jenkins-bot: Increase Nuke max age to 90 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098574 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [14:03:41] (03CR) 10Urbanecm: [C:03+2] Remove temporary fix for badly set CentralAuth cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 (owner: 10Bartosz Dziewoński) [14:03:43] (03Merged) 10jenkins-bot: knwiki: remove module namespace names from core-Namespaces.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100110 (https://phabricator.wikimedia.org/T346583) (owner: 10Anzx) [14:03:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 (owner: 10Bartosz Dziewoński) [14:04:30] (03Merged) 10jenkins-bot: Remove temporary fix for badly set CentralAuth cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 (owner: 10Bartosz Dziewoński) [14:04:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10375775 (10phaultfinder) [14:05:04] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098574|Increase Nuke max age to 90 days (T380846)]], [[gerrit:1100110|knwiki: remove module namespace names from core-Namespaces.php (T346583)]], [[gerrit:1093497|Remove temporary fix for badly set CentralAuth cookies]] [14:05:10] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [14:05:10] T346583: Change namespace names for Kannada Language - https://phabricator.wikimedia.org/T346583 [14:06:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:08:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10375789 (10Jhancock.wm) all good! i was off all last week too. thank you. [14:11:37] !log urbanecm@deploy2002 matmarex, chlod, urbanecm, anzx: Backport for [[gerrit:1098574|Increase Nuke max age to 90 days (T380846)]], [[gerrit:1100110|knwiki: remove module namespace names from core-Namespaces.php (T346583)]], [[gerrit:1093497|Remove temporary fix for badly set CentralAuth cookies]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:41] (03PS3) 10CDobbins: lvs: Deploy node_ferm_mss exporter on ferm based realservers [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) [14:11:41] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [14:11:41] T346583: Change namespace names for Kannada Language - https://phabricator.wikimedia.org/T346583 [14:11:45] (03PS5) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [14:12:11] chlod: anzx: can you check your patches at mwdebug, please? [14:12:20] MatmaRex: i saw you mentioned your patch is no op, so i assume you're good to go [14:12:50] urbanecm: nothing to check on mine [14:12:56] ack [14:13:00] mine's working good :) [14:13:11] good [14:13:12] !log urbanecm@deploy2002 matmarex, chlod, urbanecm, anzx: Continuing with sync [14:13:28] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1015-1016].eqiad.wmnet [14:13:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1015-1016].eqiad.wmnet [14:13:42] yep [14:14:14] (03CR) 10Ssingh: [C:03+1] wdqs-internal: codfw pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [14:14:15] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268#10375822 (10Jelto) [14:14:34] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:14:47] brb door [14:14:51] (03CR) 10Ssingh: "You can certainly allocate the IP in Netbox in advance so I will wait for that before reviewing the updated the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [14:15:00] (03CR) 10Vgutierrez: [C:03+1] lvs: Deploy node_ferm_mss exporter on ferm based realservers [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) (owner: 10CDobbins) [14:15:14] (03CR) 10Volans: [C:04-1] "Nice, good direction, needs a couple of tweaks." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [14:15:19] re [14:16:23] (03CR) 10Ssingh: [C:03+1] wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [14:16:34] (03PS6) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [14:16:41] (03CR) 10Ssingh: [C:03+1] wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [14:17:03] (03PS2) 10Arnaudb: mysql: add unix_socket [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [14:17:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2242.codfw.wmnet with OS bookworm [14:17:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2241.codfw.wmnet with OS bookworm [14:17:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10375830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2242.codfw.wmnet with OS b... [14:17:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10375831 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2241.codfw.wmnet with OS b... [14:18:48] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:20:23] (03PS7) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [14:21:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10375835 (10WMDECyn) Approved from WMDE side [14:21:43] (03CR) 10Arnaudb: "thanks, tweaks tweaked!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [14:22:08] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098574|Increase Nuke max age to 90 days (T380846)]], [[gerrit:1100110|knwiki: remove module namespace names from core-Namespaces.php (T346583)]], [[gerrit:1093497|Remove temporary fix for badly set CentralAuth cookies]] (duration: 17m 04s) [14:22:12] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [14:22:12] T346583: Change namespace names for Kannada Language - https://phabricator.wikimedia.org/T346583 [14:22:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100103 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [14:22:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [14:22:42] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:22:46] chlod: MatmaRex: anzx: deployed! [14:22:49] urbanecm: thanks for deployment, would be possible to run namespacedupes for knwiki , knwikisource , knwikiquote and knwiktionary [14:22:52] sure [14:22:52] thanks [14:23:13] thanks, urbanecm! :D [14:24:10] anzx: none of the wikis report any rows to fix. is that intended? [14:24:18] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [14:24:26] (03Merged) 10jenkins-bot: fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100102 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [14:24:33] (03Merged) 10jenkins-bot: fix: show thumbnails in surfacing popups [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100103 (https://phabricator.wikimedia.org/T381364) (owner: 10Michael Große) [14:24:44] urbanecm: yes maybe because less use of module talk namespace [14:24:58] anzx: +your patch doesn't really change anything, right? [14:25:04] yes [14:25:06] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1100103|fix: show thumbnails in surfacing popups (T381364)]], [[gerrit:1100102|fix: show thumbnails in surfacing popups (T381364)]] [14:25:09] T381364: Surfacing Popups missing thumbnail image - https://phabricator.wikimedia.org/T381364 [14:26:02] (03CR) 10DCausse: [C:03+1] "lgtm, couple nits (cleanups)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [14:27:43] (03CR) 10Volans: [C:04-1] "Sorry got over it too quickly... we're connecting remotely, socket won't work" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [14:29:18] (03CR) 10Arnaudb: "oh good catch! I'll come with a suggestion for port logic" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [14:30:46] !log urbanecm@deploy2002 migr, urbanecm: Backport for [[gerrit:1100103|fix: show thumbnails in surfacing popups (T381364)]], [[gerrit:1100102|fix: show thumbnails in surfacing popups (T381364)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:53] T381364: Surfacing Popups missing thumbnail image - https://phabricator.wikimedia.org/T381364 [14:31:01] (03PS2) 10Harroyo-wmf: dialog: Don't duplicate the footer in the behaviour list template [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100117 (https://phabricator.wikimedia.org/T381189) (owner: 10Kosta Harlan) [14:31:10] MichaelG_WMF: can you test in production? [14:31:13] *mwdebug [14:31:17] urbanecm: that's both wmf.5 and wmf.6 right? [14:31:25] MichaelG_WMF: correct [14:32:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2242.codfw.wmnet with reason: host reimage [14:32:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2241.codfw.wmnet with reason: host reimage [14:32:37] i got a JS error at eswiki,but i cannot reproduce it [14:32:43] this is was in console https://www.irccloud.com/pastebin/ECYVN5u2/ [14:33:09] urbanecm: for me it works in all pilot wikis [14:33:29] MichaelG_WMF: yeah, i see the image, but i got the error on a first load [14:33:46] (03PS8) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [14:33:54] possibly transient, now it looks good [14:34:10] urbanecm: I see no error with Firefox, I can try with Chromium [14:34:47] ah, but I don't have the wmfdebug extension there [14:35:15] given i can't get it to happen again, i'm inclined to going ahead... [14:35:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2242.codfw.wmnet with reason: host reimage [14:36:05] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:36:42] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:25] urbanecm: works for me in Chromium as well. Let's go ahead [14:37:43] yep, agreed. tried a fresh new session, in case it relates to browser caching, still works fine [14:37:45] !log urbanecm@deploy2002 migr, urbanecm: Continuing with sync [14:37:47] proceeding [14:38:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2241.codfw.wmnet with reason: host reimage [14:39:13] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10375904 (10Andrew) This has been recurring for some time (e.g. T368211) so probably needs DC attention. @Jhancock.wm, it's OK to power down this... [14:41:07] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10375913 (10fnegri) More previous occurrences: * {T368212} * {T370732} [14:44:30] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100103|fix: show thumbnails in surfacing popups (T381364)]], [[gerrit:1100102|fix: show thumbnails in surfacing popups (T381364)]] (duration: 19m 24s) [14:44:33] T381364: Surfacing Popups missing thumbnail image - https://phabricator.wikimedia.org/T381364 [14:52:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:54:32] (03PS6) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [14:55:10] (03PS1) 10Muehlenhoff: Fix typo in SUL reminder [software/bitu] - 10https://gerrit.wikimedia.org/r/1100132 [14:55:10] (03PS1) 10Muehlenhoff: Extend access request email template [software/bitu] - 10https://gerrit.wikimedia.org/r/1100133 [14:57:57] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:00:05] (03CR) 10CI reject: [V:04-1] mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [15:00:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:00] (03PS7) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [15:01:42] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099790 (https://phabricator.wikimedia.org/T380778) (owner: 10Jdrewniak) [15:06:14] (03PS2) 10Jdrewniak: Rerunning Web browser extension survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099790 (https://phabricator.wikimedia.org/T380778) [15:07:47] (03CR) 10Calbon: "> This only needs approval by an existing approver, IOW Chris and can then be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1100063 (owner: 10Klausman) [15:08:20] (03CR) 10Calbon: "approved!" [puppet] - 10https://gerrit.wikimedia.org/r/1100057 (https://phabricator.wikimedia.org/T381108) (owner: 10Klausman) [15:08:22] (03CR) 10Harroyo-wmf: [C:03+1] dialog: Don't duplicate the footer in the behaviour list template [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100117 (https://phabricator.wikimedia.org/T381189) (owner: 10Kosta Harlan) [15:08:35] (03CR) 10Klausman: [C:03+2] modules/admin: Add Ilias as approver for various ML-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1100063 (owner: 10Klausman) [15:08:41] (03CR) 10Calbon: [V:03+1] modules/admin: add sbisson to ML deployers on ml-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1100057 (https://phabricator.wikimedia.org/T381108) (owner: 10Klausman) [15:09:48] (03PS2) 10Klausman: modules/admin: add sbisson to ML deployers on ml-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1100057 (https://phabricator.wikimedia.org/T381108) [15:09:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:09:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2241.codfw.wmnet with OS bookworm [15:10:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:10:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2242.codfw.wmnet with OS bookworm [15:10:04] (03CR) 10Klausman: [V:03+2 C:03+2] modules/admin: add sbisson to ML deployers on ml-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1100057 (https://phabricator.wikimedia.org/T381108) (owner: 10Klausman) [15:10:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10375989 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2241.codfw.wmnet with OS bookw... [15:10:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10375990 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2242.codfw.wmnet with OS bookw... [15:10:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [15:11:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10375992 (10Jhancock.wm) [15:11:07] (03PS2) 10Bking: wdqs-internal: codfw pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [15:11:19] (03CR) 10CI reject: [V:04-1] mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [15:11:30] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API, 13Patch-For-Review: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10375993 (10klausman) This should work now. Stephane, if you could verify that it does and then resolve... [15:12:04] (03PS3) 10Bking: wdqs-internal-[main|scholarly]: pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [15:12:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10375996 (10Jhancock.wm) 05Open→03Resolved a:05Marostegui→03Jhancock.wm i copy pasta-ed the cookbook runs to the wrong ticket, but these are installed and ready to go! [15:12:52] (03PS1) 10Jelto: Rename kubernetes1021 and kubernetes1022 [puppet] - 10https://gerrit.wikimedia.org/r/1100136 (https://phabricator.wikimedia.org/T377876) [15:13:07] (03CR) 10Arnaudb: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [15:13:26] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1021-1022].eqiad.wmnet [15:14:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1021-1022].eqiad.wmnet [15:16:43] (03PS1) 10Vgutierrez: hiera: Extend bwlimit to upload cluster globally [puppet] - 10https://gerrit.wikimedia.org/r/1100137 [15:17:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100137 (owner: 10Vgutierrez) [15:21:19] (03CR) 10CDanis: [C:03+1] hiera: Extend bwlimit to upload cluster globally [puppet] - 10https://gerrit.wikimedia.org/r/1100137 (owner: 10Vgutierrez) [15:22:03] (03PS1) 10LorenMora: Deploy Vector22 To Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100138 (https://phabricator.wikimedia.org/T381041) [15:22:24] (03PS8) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [15:23:05] (03PS8) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [15:24:01] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10376045 (10MSantos) We need the same for @Atieno do we need to create a new task or use the same task suffice? [15:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10376060 (10phaultfinder) [15:25:14] (03PS2) 10Alexandros Kosiaris: gateway-check: Make indentation consistent [puppet] - 10https://gerrit.wikimedia.org/r/1100111 [15:25:15] (03PS9) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [15:26:00] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes1021 and kubernetes1022 [puppet] - 10https://gerrit.wikimedia.org/r/1100136 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [15:26:17] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10376063 (10Jhancock.wm) Now that you mention it, I think this might be a PDU issue rather than a server issue. Looking back through the tickets w... [15:28:06] (03CR) 10Jelto: [C:03+2] Rename kubernetes1021 and kubernetes1022 [puppet] - 10https://gerrit.wikimedia.org/r/1100136 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [15:28:07] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [15:28:16] (03PS2) 10Elukey: Add the mapnik image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) [15:28:28] (03CR) 10Elukey: "Makes sense! added :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [15:28:28] (03PS9) 10Bking: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [15:29:29] (03CR) 10Elukey: [V:03+2 C:03+2] Add the mapnik image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100070 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [15:31:13] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1021 to wikikube-worker1034 [15:31:34] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [15:31:38] (03PS3) 10Alexandros Kosiaris: gateway-check: Make indentation consistent [puppet] - 10https://gerrit.wikimedia.org/r/1100111 [15:31:38] (03PS10) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [15:31:53] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:43] (03CR) 10CI reject: [V:04-1] mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [15:34:28] (03CR) 10CI reject: [V:04-1] gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [15:35:38] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1021 to wikikube-worker1034 - jelto@cumin1002" [15:36:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1021 to wikikube-worker1034 - jelto@cumin1002" [15:36:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:18] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1034 [15:37:32] (03PS9) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [15:37:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1034 [15:37:58] (03PS10) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [15:38:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1021 to wikikube-worker1034 [15:38:59] (03PS11) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [15:39:11] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1022 to wikikube-worker1035 [15:39:18] (03PS1) 10Elukey: Fix changelog warnings related to spark3.3 and jaeger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100141 [15:39:31] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [15:39:55] (03PS2) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) [15:40:12] (03PS2) 10Elukey: Fix changelog warnings related to spark3.3 and jaeger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100141 [15:40:28] (03PS1) 10Chlod Alejandro: Revert "Increase Nuke max age to 90 days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 [15:41:32] (03PS2) 10Chlod Alejandro: Revert "Increase Nuke max age to 90 days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 (https://phabricator.wikimedia.org/T380846) [15:42:04] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10376117 (10elukey) Better to open a new one for traceability! [15:42:09] (03CR) 10Btullis: [C:03+1] "Thank you." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100141 (owner: 10Elukey) [15:42:49] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394 (10RobH) 03NEW [15:43:08] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10376139 (10RobH) [15:43:18] (03PS12) 10Alexandros Kosiaris: gateway-check: Support per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [15:43:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1022 to wikikube-worker1035 - jelto@cumin1002" [15:44:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [15:45:17] (03PS3) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) [15:45:38] (03CR) 10CI reject: [V:04-1] wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [15:45:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1022 to wikikube-worker1035 - jelto@cumin1002" [15:45:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:46] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1035 [15:47:07] (03PS1) 10Muehlenhoff: graphite: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100144 [15:47:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1035 [15:48:34] (03CR) 10Jdrewniak: [C:03+1] Deploy Vector22 To Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100138 (https://phabricator.wikimedia.org/T381041) (owner: 10LorenMora) [15:48:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1022 to wikikube-worker1035 [15:48:56] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1034.eqiad.wmnet wikikube-worker1035.eqiad.wmnet on all recursors [15:48:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1034.eqiad.wmnet wikikube-worker1035.eqiad.wmnet on all recursors [15:49:28] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T381174#10376196 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:49:33] (03CR) 10BCornwall: [V:03+1 C:03+2] icinga: Remove RSA cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [15:49:35] (03CR) 10BCornwall: [V:03+1 C:03+2] haproxy: Remove RSA certificate support [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [15:50:39] (03PS2) 10Muehlenhoff: graphite: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100144 [15:51:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1034.eqiad.wmnet with OS bookworm [15:52:14] (03PS4) 10Bking: wdqs-internal-[main|scholarly]: pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [15:52:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100144 (owner: 10Muehlenhoff) [15:53:21] (03PS5) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [15:53:36] (03PS10) 10Bking: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [15:53:39] (03CR) 10CI reject: [V:04-1] wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [15:54:40] (03PS13) 10Alexandros Kosiaris: gateway-check: Support (and use) per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [15:55:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10376270 (10Jhancock.wm) [15:56:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100138 (https://phabricator.wikimedia.org/T381041) (owner: 10LorenMora) [15:57:33] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10376308 (10SBisson) >>! In T381108#10375993, @klausman wrote: > This should work now. Stephane, if you could verify that it d... [16:00:05] eoghan, jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1600). [16:00:20] (03CR) 10Kgraessle: [C:03+1] Revert "Increase Nuke max age to 90 days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [16:02:06] (03CR) 10Hnowlan: [C:03+1] gateway-check: Support (and use) per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [16:04:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10376348 (10phaultfinder) [16:05:51] (03PS1) 10Muehlenhoff: releases: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100146 [16:07:05] !log installing intel-microcode security updates [16:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:09] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10376383 (10isarantopoulos) You could verify that you can deploy recapi in ml-staging-codfw check if there is any diff ` cd /... [16:08:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100146 (owner: 10Muehlenhoff) [16:09:29] jouncebot: nowandnext [16:09:29] For the next 0 hour(s) and 50 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1600) [16:09:29] In 0 hour(s) and 50 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1700) [16:12:58] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376400 (10Gehel) [16:16:31] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408 (10Atieno) 03NEW [16:17:55] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy once the change linked in Depends-On is fully rolled out with the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor) [16:19:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10376422 (10phaultfinder) [16:19:53] !log rebalance Ganeti eqiad/B following server refreshes [16:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:23] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10376430 (10MSantos) As the Product Manager responsible for the MediaWiki Release process, I approve this request. [16:22:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10376448 (10phaultfinder) [16:26:52] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10376426 (10MSantos) Got it! Done in {T381408}. [16:27:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:27:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:27:55] RESOLVED: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:29:04] (03PS5) 10Ryan Kemper: wdqs-internal: pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) [16:29:04] (03PS11) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [16:29:04] (03PS4) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) [16:29:05] (03PS6) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [16:29:05] (03PS5) 10Ryan Kemper: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) [16:29:08] (03PS5) 10Ryan Kemper: wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) [16:30:33] !log Disabling puppet on A:cp to prep for RSA removal - T370837 [16:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:36] T370837: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837 [16:34:30] (03CR) 10Ssingh: [C:03+1] wdqs-internal: pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [16:34:40] (03PS1) 10Atieno: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1100149 (https://phabricator.wikimedia.org/T381408) [16:35:39] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10376510 (10BPirkle) As acting Engineering Manager for MediaWiki Interfaces, I approve this request. [16:35:59] (03CR) 10Ssingh: [C:03+2] sre.roll-restart-reboot-wikimedia-dns: update aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 (owner: 10Ssingh) [16:36:32] jouncebot: now [16:36:32] For the next 0 hour(s) and 23 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1600) [16:36:57] (03CR) 10Urbanecm: [C:03+2] Revert "Increase Nuke max age to 90 days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [16:37:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [16:37:56] (03Merged) 10jenkins-bot: Revert "Increase Nuke max age to 90 days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100142 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [16:38:05] (03CR) 10MSantos: [C:03+1] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1100149 (https://phabricator.wikimedia.org/T381408) (owner: 10Atieno) [16:38:23] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1100142|Revert "Increase Nuke max age to 90 days" (T380846)]] [16:38:26] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [16:41:34] (03PS22) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [16:42:06] (03CR) 10Herron: [C:03+1] Fix changelog warnings related to spark3.3 and jaeger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100141 (owner: 10Elukey) [16:42:48] (03CR) 10Ssingh: "Note that a manual DNS entry for the above service IPs is still required." [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:43:09] (03PS2) 10Hnowlan: jobqueue: disable webVideoTranscodePrioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701) [16:43:15] (03CR) 10CI reject: [V:04-1] mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:44:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:44:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:45:50] (03PS23) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [16:46:58] w [16:47:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:47:55] (03PS1) 10Tchanders: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) [16:49:00] (03CR) 10Ssingh: [C:03+1] wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) (owner: 10Ryan Kemper) [16:49:37] (03CR) 10Ssingh: "Ignore the previous comment, that was my bad as I didn't see the other patch." [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:49:47] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1034.eqiad.wmnet with OS bookworm [16:50:43] (03CR) 10Ssingh: [C:03+1] wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:50:53] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100142|Revert "Increase Nuke max age to 90 days" (T380846)]] (duration: 12m 29s) [16:50:57] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [16:51:00] (03CR) 10Dreamy Jazz: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [16:51:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1034.eqiad.wmnet with OS bookworm [16:51:20] !log sbisson@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [16:51:45] (03CR) 10Ssingh: [C:03+1] "(Looks OK to me but please check as I haven't reviewed envoy stuff before to know if something else is missing)" [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [16:52:11] thanks for the early deploy urbanecm, sorry for the extra work 🙇‍♂️ [16:52:19] (03CR) 10Elukey: [V:03+2 C:03+2] Fix changelog warnings related to spark3.3 and jaeger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100141 (owner: 10Elukey) [16:52:40] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: dc=magru,service=cdn,name=cp7001.magru.wmnet [16:53:46] chlod: no worries. feel free to ping for similar breakages, happy to deploy the revert before w/o waiting on the window :) [16:55:36] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10376638 (10SBisson) The diff produced: ` skipping missing values file matching "values-main.yaml" Comparing release=main, cha... [16:55:38] 06SRE, 06Infrastructure-Foundations, 06SRE Observability, 07Kubernetes: aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417 (10herron) 03NEW [16:55:56] am I okay to do some deploying on mw-related stuff now that the sync is done? [16:56:13] hnowlan: from my perspective, go ahead :) [16:56:35] 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10376653 (10herron) [16:56:40] (03CR) 10Ssingh: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:56:51] chlod: do you know if the failure is reproducible on beta? if it is, that might be a good place to start from, but i do not know how many revisions was "too many" here. [16:56:52] (03CR) 10Ssingh: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:57:01] (03CR) 10Ssingh: [C:03+1] wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:58:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:58:04] (03CR) 10Hnowlan: [C:03+2] jobqueue: disable webVideoTranscodePrioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:58:24] (03CR) 10Dzahn: [C:03+2] aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:59:08] 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10376663 (10herron) 05Open→03Resolved VMs built, tracking remaining setup in {T381417} [16:59:08] (03Merged) 10jenkins-bot: jobqueue: disable webVideoTranscodePrioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:59:49] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10376668 (10herron) 05Open→03Resolved VMs built, tracking remaining setup in {T381417} [17:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:32] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376692 (10CDanis) I think we need an-worker* source port 50010, which I am pretty sure is just the dataplane of HDFS and not the metada... [17:00:59] (03CR) 10Hnowlan: [C:03+2] mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:01:10] 06SRE, 10vm-requests, 07Kubernetes: codfw: (4x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378987#10376676 (10herron) 05Open→03Resolved VMs built, tracking remaining setup in {T381417} [17:01:17] (03CR) 10Scott French: [C:03+1] gateway-check: Support (and use) per wiki rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [17:02:34] (03CR) 10Dzahn: [C:03+2] "tested push notifications, works as normal" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:02:48] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376688 (10CDanis) [17:05:10] (03Merged) 10jenkins-bot: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:07:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [17:07:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10376709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1091.eqiad.wmnet with OS bullseye [17:08:06] (03CR) 10Dzahn: [C:03+2] releases: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100146 (owner: 10Muehlenhoff) [17:08:10] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1034.eqiad.wmnet with reason: host reimage [17:08:23] urbanecm: trying to find out right now. i'm thinking that the sheer size of enwiki was just not going to make everything go smoothly, but this could also be a side effect of us switching to the revision table instead of the recentchanges one. [17:08:36] the plot thickens, i guess [17:08:53] chlod: if you want me to do some checks at enwiki with the revert now deployed, let me know [17:09:07] as long as you don't want me to actually delete pages there :)) [17:09:22] hehe [17:09:47] sam walton did some checks just now; it's been mitigated but some particularly broad checks such as searching for the title "%(video game)" still causes timeouts [17:10:03] (03CR) 10Dzahn: [C:03+2] ci: add WikimediaMessages to git cache [puppet] - 10https://gerrit.wikimedia.org/r/1099657 (https://phabricator.wikimedia.org/T374717) (owner: 10Hashar) [17:10:55] chlod: i see. progress, i guess. well, good luck, and let me know if i can help somehow – very happy to see this moving forward. [17:11:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1034.eqiad.wmnet with reason: host reimage [17:17:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [17:18:42] (03PS1) 10Herron: add aux-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100153 (https://phabricator.wikimedia.org/T381417) [17:18:45] (03PS2) 10Andrew Bogott: Puppet agent: allow hiera config of number_of_facts_soft_limit [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) [17:18:46] (03PS1) 10Andrew Bogott: Openstack nova: reduce overprivision ratio for disk and cpu [puppet] - 10https://gerrit.wikimedia.org/r/1100154 (https://phabricator.wikimedia.org/T380099) [17:19:14] (03PS2) 10Tchanders: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) [17:19:24] (03CR) 10Dzahn: [C:03+2] "I'm only merging this because it's already cherry-picked and only influences beta. Not really a review." [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [17:19:24] (03CR) 10Tchanders: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [17:19:31] (03PS8) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [17:19:31] (03PS6) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [17:20:06] (03CR) 10CI reject: [V:04-1] role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [17:20:40] (03CR) 10Ahmon Dancy: "Thanks Daniel. I'm going to work on this a bit more given Bryan's comments. I'll let you know when it's done." [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [17:20:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [17:21:08] (03CR) 10Volans: "Some questions inline, missing tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [17:21:34] (03PS9) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [17:21:34] (03PS7) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [17:21:53] (03PS2) 10Andrew Bogott: Openstack nova: reduce overprovision ratio for disk and cpu [puppet] - 10https://gerrit.wikimedia.org/r/1100154 (https://phabricator.wikimedia.org/T380099) [17:21:53] (03PS3) 10Andrew Bogott: Puppet agent: allow hiera config of number_of_facts_soft_limit [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) [17:24:01] (03PS1) 10Hnowlan: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) [17:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10376782 (10phaultfinder) [17:26:48] (03CR) 10Scott French: [C:03+1] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:27:36] (03CR) 10Hnowlan: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:29:14] jouncebot: nowandnext [17:29:15] For the next 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1700) [17:29:15] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1800) [17:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10376795 (10phaultfinder) [17:30:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1034.eqiad.wmnet with OS bookworm [17:32:38] (03CR) 10Hnowlan: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:32:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1035.eqiad.wmnet with OS bookworm [17:32:43] (03CR) 10Hnowlan: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:36:07] (03PS2) 10Hnowlan: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) [17:36:22] (03CR) 10Andrew Bogott: [C:03+2] Openstack nova: reduce overprovision ratio for disk and cpu [puppet] - 10https://gerrit.wikimedia.org/r/1100154 (https://phabricator.wikimedia.org/T380099) (owner: 10Andrew Bogott) [17:37:08] (03PS2) 10Ebernhardson: cirrus: Configure MLR buckets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) [17:37:08] (03CR) 10Ebernhardson: cirrus: Configure MLR buckets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [17:37:08] (03PS1) 10Ebernhardson: cirrus: Enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) [17:38:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:38:44] (03PS10) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [17:38:44] (03PS8) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [17:39:25] (03CR) 10Ebernhardson: cirrus: Configure MLR buckets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [17:39:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [17:40:15] (03CR) 10Hnowlan: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:42:33] PROBLEM - MariaDB Replica SQL: s4 on db1245 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table wbc_entity_usage is corrupt: try to repair it on query. Default database: commonswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:43:52] (03PS2) 10Fabfur: cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [17:44:56] (03PS1) 10Hnowlan: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100160 (https://phabricator.wikimedia.org/T371701) [17:46:50] !log Removing RSA certificate support from haproxy/cp (T370837) [17:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:53] T370837: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837 [17:47:38] (03CR) 10Hnowlan: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100160 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:47:46] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: dc=magru,service=cdn,name=cp7001.magru.wmnet [17:47:51] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:48:12] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Remove RSA deprecation warning page [puppet] - 10https://gerrit.wikimedia.org/r/1099791 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:48:22] (03Abandoned) 10Hnowlan: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100155 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:48:31] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1035.eqiad.wmnet with reason: host reimage [17:48:50] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2026:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:49:19] (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100160 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:49:27] FIRING: [23x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:34] FIRING: [4x] ProbeDown: Service wdqs2026:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:48] (03PS1) 10JHathaway: puppet 7: fix facter.conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) [17:50:05] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs2026.codfw.wmnet with reason: T376150 [17:50:08] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [17:50:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs2026.codfw.wmnet with reason: T376150 [17:52:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1035.eqiad.wmnet with reason: host reimage [17:52:25] (03PS11) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [17:52:25] (03PS9) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 [17:52:36] (03PS3) 10Fabfur: cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [17:53:13] (03PS1) 10Thcipriani: Add a banner for the 2024 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100162 (https://phabricator.wikimedia.org/T351109) [17:53:44] (03CR) 10CI reject: [V:04-1] Add a banner for the 2024 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100162 (https://phabricator.wikimedia.org/T351109) (owner: 10Thcipriani) [17:54:01] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10376882 (10cmooney) >>! In T381389#10376688, @CDanis wrote: > I think we need an-worker* source port 50010, which I am pretty sure is ju... [17:54:27] FIRING: [15x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:27] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:51] (03PS2) 10Thcipriani: Add a banner for the 2024 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100162 (https://phabricator.wikimedia.org/T351109) [17:54:59] (03PS3) 10Fabfur: hiera: enable haproxykafka on drmrs and magru [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) [17:55:20] (03CR) 10CI reject: [V:04-1] Add a banner for the 2024 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100162 (https://phabricator.wikimedia.org/T351109) (owner: 10Thcipriani) [17:56:30] (03CR) 10Thcipriani: [C:04-1] "Not deployable until survey launch (2024-12-05)" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100162 (https://phabricator.wikimedia.org/T351109) (owner: 10Thcipriani) [17:56:38] (03CR) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [17:56:46] (03CR) 10JHathaway: "Another option would be to block the larger facts, such as mountpoints, in the facter config if they are not used. What facts are causing " [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [17:56:51] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:57:05] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:57:12] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [17:57:32] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:57:35] (03PS4) 10Fabfur: cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [17:57:47] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:58:33] (03PS1) 10Hashar: Reinstate the banner for the developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 [17:59:19] (03PS1) 10Hnowlan: Revert "jobqueue: disable webVideoTranscodePrioritized" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100164 [17:59:27] FIRING: [15x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:02] (03PS1) 10Bking: wdqs-internal-main, wdqs-internal-scholarly: add discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1800) [18:00:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:00:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1091.eqiad.wmnet with OS bullseye [18:00:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10376913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1091.eqiad.wmnet with OS bullseye complete... [18:01:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10376916 (10Jclark-ctr) [18:01:13] (03CR) 10CI reject: [V:04-1] wdqs-internal-main, wdqs-internal-scholarly: add discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [18:01:25] (03CR) 10BryanDavis: [C:03+1] role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [18:01:45] (03CR) 10Ahmon Dancy: [C:03+1] "Daniel, this is ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [18:01:50] (03PS2) 10Bking: wdqs-internal-main, wdqs-internal-scholarly: add discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) [18:02:09] (03CR) 10Hnowlan: [C:03+2] Revert "jobqueue: disable webVideoTranscodePrioritized" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100164 (owner: 10Hnowlan) [18:03:50] (03Merged) 10jenkins-bot: Revert "jobqueue: disable webVideoTranscodePrioritized" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100164 (owner: 10Hnowlan) [18:05:23] (03CR) 10Ssingh: [C:03+1] wdqs-internal-main, wdqs-internal-scholarly: add discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [18:06:13] (03PS1) 10Cathal Mooney: New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) [18:06:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:07:09] (03PS5) 10Fabfur: cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [18:09:04] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [18:11:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1035.eqiad.wmnet with OS bookworm [18:11:31] (03CR) 10Andrea Denisse: [V:04-1] "Waiting for Tyler Cipriani's approval." [puppet] - 10https://gerrit.wikimedia.org/r/1100149 (https://phabricator.wikimedia.org/T381408) (owner: 10Atieno) [18:15:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10376935 (10andrea.denisse) Hi @thcipriani do you approve this request? [18:16:13] (03PS2) 10Cathal Mooney: New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) [18:16:25] (03PS1) 10Ryan Kemper: wdqs-internal: remove dupl defs for wdqs20[18-20] [puppet] - 10https://gerrit.wikimedia.org/r/1100167 (https://phabricator.wikimedia.org/T376150) [18:16:48] (03PS2) 10Ryan Kemper: wdqs-internal: remove dupl defs for wdqs20[18-20] [puppet] - 10https://gerrit.wikimedia.org/r/1100167 (https://phabricator.wikimedia.org/T376150) [18:17:27] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [18:18:00] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100167 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:18:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10377028 (10andrea.denisse) [18:18:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10377030 (10andrea.denisse) a:05Atieno→03thcipriani [18:18:25] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10377041 (10thcipriani) >>! In T381123#10370380, @elukey wrote: > @thcipriani Hi! Could you review this request? Lemme kno... [18:18:34] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10377042 (10andrea.denisse) a:03thcipriani [18:18:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10377045 (10thcipriani) >>! In T381408#10376934, @andrea.denisse wrote: > Hi @thcipriani do you approve this request? Yes, approved! [18:19:01] (03CR) 10Andrea Denisse: [C:03+2] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (https://phabricator.wikimedia.org/T381123) (owner: 10Arlolra) [18:19:05] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10377046 (10andrea.denisse) a:05thcipriani→03None [18:21:17] (03PS1) 10Andrea Denisse: Revert "Add myself to releasers-mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/1100168 [18:22:08] (03CR) 10Ssingh: "Seems to have signed it? https://phabricator.wikimedia.org/legalpad/signatures/3/query/99k87RGqyRwr/#R" [puppet] - 10https://gerrit.wikimedia.org/r/1100168 (owner: 10Andrea Denisse) [18:22:52] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10377069 (10andrea.denisse) Hi @elukey , you mentioned this was access request was good to merge after @thcipriani approva... [18:22:59] !log homer 'cr*eqiad*' commit 'T377876' [18:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:02] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [18:23:14] (03CR) 10Andrea Denisse: [C:03+2] Revert "Add myself to releasers-mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/1100168 (owner: 10Andrea Denisse) [18:23:43] (03CR) 10Dzahn: [C:03+1] "has approvals now" [puppet] - 10https://gerrit.wikimedia.org/r/1100149 (https://phabricator.wikimedia.org/T381408) (owner: 10Atieno) [18:24:22] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268#10377086 (10Jelto) [18:24:27] FIRING: [15x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10377100 (10phaultfinder) [18:26:02] (03PS1) 10Andrea Denisse: Revert^2 "Add myself to releasers-mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/1100170 [18:26:08] (03CR) 10Andrea Denisse: [C:03+2] Revert^2 "Add myself to releasers-mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/1100170 (owner: 10Andrea Denisse) [18:26:09] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert^2 "Add myself to releasers-mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/1100170 (owner: 10Andrea Denisse) [18:27:15] (03CR) 10Bking: [C:03+1] wdqs-internal: remove dupl defs for wdqs20[18-20] [puppet] - 10https://gerrit.wikimedia.org/r/1100167 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:28:05] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: remove dupl defs for wdqs20[18-20] [puppet] - 10https://gerrit.wikimedia.org/r/1100167 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:28:27] (03CR) 10Ssingh: "Nice work!" [dns] - 10https://gerrit.wikimedia.org/r/1097521 (owner: 10CDobbins) [18:29:56] (03CR) 10DCausse: [C:04-1] cirrus: Configure MLR buckets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [18:30:19] (03PS1) 10Ryan Kemper: wdqs-internal: add graph split hosts to scap [puppet] - 10https://gerrit.wikimedia.org/r/1100171 (https://phabricator.wikimedia.org/T376150) [18:30:20] (03PS1) 10Jelto: Revert "trafficserver: switch query-scholarly to wikikube" [puppet] - 10https://gerrit.wikimedia.org/r/1100172 (https://phabricator.wikimedia.org/T350793) [18:30:50] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10377117 (10Dzahn) Since we could not test if the service starts on list2001 (fails because it can't talk to the DB), I created a test instance... [18:30:55] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10377122 (10phaultfinder) [18:31:04] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10377123 (10andrea.denisse) [18:31:12] (03CR) 10Jelto: [C:03+2] Revert "trafficserver: switch query-scholarly to wikikube" [puppet] - 10https://gerrit.wikimedia.org/r/1100172 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [18:31:26] (03CR) 10Bking: [C:03+1] wdqs-internal: add graph split hosts to scap [puppet] - 10https://gerrit.wikimedia.org/r/1100171 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:31:36] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: add graph split hosts to scap [puppet] - 10https://gerrit.wikimedia.org/r/1100171 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:31:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:15] (03CR) 10Andrea Denisse: [V:04-1 C:03+2] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1100149 (https://phabricator.wikimedia.org/T381408) (owner: 10Atieno) [18:33:53] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10377149 (10KFrancis) Thanks! The NDA is out for signatures. I'll confirm when it's complete. [18:34:27] FIRING: [14x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1034-1035].eqiad.wmnet [18:35:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1034-1035].eqiad.wmnet [18:35:47] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10377112 (10andrea.denisse) 05Open→03Resolved a:03andrea.denisse Marking the task as resolved, feel free to reach out if there's anyth... [18:36:45] (03PS4) 10Ryan Kemper: wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) [18:36:45] (03PS3) 10Ryan Kemper: wdqs-internal-main, wdqs-internal-scholarly: add discovery DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [18:37:11] (03PS3) 10Ebernhardson: cirrus: Configure MLR buckets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) [18:37:11] (03CR) 10Ebernhardson: cirrus: Configure MLR buckets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [18:37:11] (03PS2) 10Ebernhardson: cirrus: Enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) [18:37:32] (03CR) 10Ebernhardson: cirrus: Configure MLR buckets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [18:38:27] (03PS3) 10Ebernhardson: cirrus: Enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) [18:38:32] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10377155 (10SBisson) 05Open→03Resolved I guess the results of the diff and sync commands above confirm that I do have... [18:39:10] (03PS4) 10Ryan Kemper: wdqs-internal: add graph split disc DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [18:39:16] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host [18:39:27] FIRING: [14x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:28] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host (duration: 00m 11s) [18:39:35] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host [18:39:47] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host (duration: 00m 11s) [18:40:11] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host [18:41:23] (03CR) 10Bking: [C:03+1] wdqs-internal: add graph split disc DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [18:43:28] (03PS3) 10Andrea Denisse: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1100149 (https://phabricator.wikimedia.org/T381408) (owner: 10Atieno) [18:43:43] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host (duration: 03m 31s) [18:44:27] FIRING: [7x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:07] (03PS1) 10Ryan Kemper: wdqs-internal: setup 2 eqiad hosts for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100173 (https://phabricator.wikimedia.org/T376150) [18:46:40] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100173 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:47:31] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host [18:47:46] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host (duration: 00m 14s) [18:48:33] (03PS2) 10Ryan Kemper: wdqs-internal: setup 2 eqiad hosts for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100173 (https://phabricator.wikimedia.org/T376150) [18:49:01] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host [18:49:15] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host (duration: 00m 14s) [18:49:27] FIRING: [7x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:40] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10377205 (10andrea.denisse) 05Open→03Stalled [18:50:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10377208 (10andrea.denisse) 05Open→03Stalled [18:51:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for Atieno - https://phabricator.wikimedia.org/T381408#10377187 (10andrea.denisse) 05Open→03Resolved Marking the task as resolved, feel free to reach out if there's anything else we can assist with. [18:51:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:52:10] (03CR) 10Bking: [C:03+1] wdqs-internal: setup 2 eqiad hosts for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100173 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:52:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10377210 (10andrea.denisse) 05Open→03Stalled a:03thcipriani [18:52:40] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: setup 2 eqiad hosts for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100173 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:52:51] RESOLVED: PuppetFailure: Puppet has failed on wdqs2027:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:53:26] (03PS4) 10Scott French: mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) [18:53:26] (03PS4) 10Scott French: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) [18:53:27] (03PS4) 10Scott French: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) [18:53:27] (03PS4) 10Scott French: mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) [18:53:27] (03PS4) 10Scott French: mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040) [18:54:27] FIRING: [7x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:30] FIRING: [4x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:55:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2027:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:56:42] (03CR) 10Dzahn: [C:03+2] "yea, this is is nicer than hardcoding the 2 user names:)" [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [18:56:53] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer scholarly_articles from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet, repooling source-only afterwards [18:56:53] ^ I am assuming these alerts are known and related [18:56:56] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [18:57:52] sukhe indeed, we will suppress those [18:57:56] thanks [18:58:03] (03CR) 10DCausse: [C:03+1] cirrus: Configure MLR buckets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [18:58:05] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs2027.codfw.wmnet with reason: T376150 [18:58:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs2027.codfw.wmnet with reason: T376150 [18:58:14] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1100144 (owner: 10Muehlenhoff) [18:58:15] inflatador: no worries about supressing, was just making sure it's all OK (since on on-call :) [18:58:59] (03CR) 10DCausse: [C:03+1] cirrus: Enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [18:59:02] (03PS1) 10Ryan Kemper: wdqs-internal: add hiera for eqiad graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100175 (https://phabricator.wikimedia.org/T379329) [18:59:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal main tier) xfer scholarly_articles from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet, repooling source-only afterwards [18:59:50] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10377262 (10thcipriani) 05Stalled→03Open a:05thcipriani→03andrea.denisse >>! In T380525#10357449, @elukey wrote: > @thcipriani Hi! I'd need... [19:00:05] thcipriani and thcipriani: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T1900). [19:00:41] oh, that's me [19:00:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10377271 (10sbassett) Will the analytics-privatedata-users access request include setting up [[ https://wikitech.wikimedia.org/wiki/Data_Platform/S... [19:00:50] ah, shoot [19:00:55] * thcipriani fixes calendar [19:00:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:58] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:01:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:22] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:02:25] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [19:02:56] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100176 (https://phabricator.wikimedia.org/T375665) [19:02:58] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100176 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [19:03:50] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100176 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [19:04:18] (03CR) 10Bking: [C:03+1] wdqs-internal: add hiera for eqiad graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100175 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [19:04:35] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1278-1279].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [19:04:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:05:07] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: add hiera for eqiad graph split [puppet] - 10https://gerrit.wikimedia.org/r/1100175 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [19:05:51] (03CR) 10Dzahn: [C:03+2] "I assume you are going to replace/remove the previously cherry-picked one now." [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [19:06:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [19:07:16] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10377308 (10herron) [19:09:07] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host [19:10:46] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1278.eqiad.wmnet with OS bookworm [19:11:53] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host (duration: 02m 45s) [19:11:55] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host [19:11:57] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:12:00] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [19:13:05] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host (duration: 01m 09s) [19:13:13] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host [19:13:21] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host (duration: 00m 07s) [19:14:20] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:14:27] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:30] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1027.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:15:17] !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [19:15:37] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10377327 (10herron) > I am writing this email in regard to case #3622388, which is about "escalation not working as expected". > > As mentioned in the previous email... [19:15:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:15:40] !log rebooting rpki2003 to clear out tmpfs filesystem which is full [19:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:00] RECOVERY - Routinator process on rpki2003 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [19:16:38] RECOVERY - RPKI Validator RTR port on rpki2003 is OK: TCP OK - 0.031 second response time on 10.192.24.3 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [19:17:57] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.6 refs T375665 [19:18:02] (03PS2) 10JHathaway: puppet 7: fix facter.conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) [19:18:04] T375665: 1.44.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T375665 [19:18:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [19:19:04] RECOVERY - Disk space on rpki2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=rpki2003&var-datasource=codfw+prometheus/ops [19:19:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [19:20:55] RESOLVED: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:30] FIRING: [3x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:54] (03PS3) 10JHathaway: puppet 7: fix facter.conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) [19:26:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:27:01] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [19:29:30] FIRING: [4x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:19] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1278.eqiad.wmnet with reason: host reimage [19:31:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1278.eqiad.wmnet with reason: host reimage [19:44:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:49:57] (03PS1) 10Ssingh: wikimedia.org: test if CI picks up duplicate IP [dns] - 10https://gerrit.wikimedia.org/r/1100187 [19:52:48] (03PS2) 10Ssingh: wikimedia.org: test if CI picks up duplicate IP [dns] - 10https://gerrit.wikimedia.org/r/1100187 [19:53:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1278.eqiad.wmnet with OS bookworm [19:54:10] (03CR) 10CI reject: [V:04-1] wikimedia.org: test if CI picks up duplicate IP [dns] - 10https://gerrit.wikimedia.org/r/1100187 (owner: 10Ssingh) [19:54:21] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:54:50] (03Abandoned) 10Ssingh: wikimedia.org: test if CI picks up duplicate IP [dns] - 10https://gerrit.wikimedia.org/r/1100187 (owner: 10Ssingh) [19:55:28] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1279.eqiad.wmnet with OS bookworm [19:55:53] PROBLEM - MariaDB Replica Lag: s1 on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 480.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:57:24] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:57:27] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [19:59:21] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:59:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [19:59:36] hello [19:59:39] !incidents [19:59:39] 5505 (UNACKED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [19:59:39] 5504 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [19:59:40] 5503 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [19:59:40] 5502 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [19:59:40] 5501 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [19:59:43] !ack 5505 [19:59:44] 5505 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [19:59:53] is this a repeat of yesterday's I wonder [20:00:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:51] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1027.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:01:52] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:03:30] hmm [20:03:38] I wonder if this can be related to the above [20:04:03] wdqs1021 -> wdqs1026 [20:04:12] (03CR) 10Ahmon Dancy: [C:03+1] "Confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [20:04:27] FIRING: [7x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:30] FIRING: [8x] ProbeDown: Service wdqs1023:443 has failed probes (http_wdqs_scholarly_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:39] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:04:42] resolved too [20:04:44] sukhe: that seems plausible row-wise, yeah [20:05:26] https://librenms.wikimedia.org/device/device=161/tab=port/port=35736/ [20:08:12] actually more accurately [20:09:34] not sure how to read the cookbook invocation without diving into it but also https://librenms.wikimedia.org/graphs/to=1733256300/id=31428/type=port_bits/from=1733169900/ this is wdqs1023 [20:10:15] anyway the timing matches and it resolved so I am going to attribute it to that unless it happens again [20:10:25] swfrench-wmf: thanks for the rubber duck [20:10:40] what's needed from my team, does this mean we need to add some throttling to the transfer rate? [20:10:53] I haven't seen this fire before but I gather it could be just due to the specifics of the row [20:10:57] possibly but I have seen you run this before I think and not alerted? [20:11:01] yeah [20:11:28] was there something new that was being done this time? as in, was there something different about the transfer? [20:15:56] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1279.eqiad.wmnet with reason: host reimage [20:19:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1279.eqiad.wmnet with reason: host reimage [20:22:01] sukhe: nothing different [20:22:45] (03PS6) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [20:22:46] (03CR) 10Kamila Součková: "Not ready for a detailed review yet, but would like to know if the high-level approach roughly makes sense." [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [20:29:27] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:21] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:34:21] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:34:27] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:21] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:38:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1279.eqiad.wmnet with OS bookworm [20:38:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1278-1279].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [20:42:44] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:46:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [20:46:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [20:46:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:48] FIRING: PuppetFailure: Puppet has failed on wdqs1027:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:48:30] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1089 [20:48:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1089 [20:48:44] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1090 [20:48:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1090 [20:48:59] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1088 [20:49:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1088 [20:49:10] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1087 [20:49:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1087 [20:49:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377654 (10Jclark-ctr) [20:50:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10377655 (10andrea.denisse) [20:53:53] RECOVERY - MariaDB Replica Lag: s1 on db1206 is OK: OK slave_sql_lag Replication lag: 41.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:54:27] (03PS1) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) [20:54:29] (03PS1) 10BCornwall: haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) [20:56:01] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4624/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:56:09] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4625/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:56:40] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10377666 (10Andrew) a:05Andrew→03ssingh I've checked all the resolv.confs and they all look fine. I'm passing this task over to... [20:58:31] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] admin: add Jimmy Ly's account [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) (owner: 10Elukey) [20:59:42] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:51] (03CR) 10Kamila Součková: "(Note to self: to support renames I just need to add `--new`, for renumbering I need to instead call the renumber-node cookbook)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241203T2100). [21:00:05] jan_drewniak, ebernhardson, and srishakatux: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:36] Hello, we need help deploying if anyone is available. (I'm doing the testing in place of jan_drewniak) The order in the table is the order we want. Thank you [21:02:12] \o [21:02:28] FIRING: SystemdUnitCrashLoop: wdqs-updater.service crashloop on wdqs1027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:04:27] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:35] i suppose i can do the deploy [21:05:48] kimberly_sarabia: they need to be deployed 1 at a time? [21:06:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10377682 (10andrea.denisse) 05Open→03Resolved Hi @sbassett, a separate task is required to set-up the Kerberos credentials. Please follow t... [21:06:43] ebernhardson: Thanks! Actually, I briefly was told yeah they could be run at the same time. [21:06:55] So yes [21:07:17] kimberly_sarabia: so merge them as one deploy for all 3? [21:07:28] !log dancy@deploy2002 Installing scap version "4.132.0" for 1 host(s) [21:08:19] !log dancy@deploy2002 Installation of scap version "4.132.0" completed for 1 hosts [21:08:32] ebernhardson: Yup! [21:08:42] (03PS1) 10Andrew Bogott: Initial (insetup) puppet for cloudcontrol1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1100195 (https://phabricator.wikimedia.org/T380499) [21:09:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099790 (https://phabricator.wikimedia.org/T380778) (owner: 10Jdrewniak) [21:09:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [21:09:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100138 (https://phabricator.wikimedia.org/T381041) (owner: 10LorenMora) [21:09:47] (03Merged) 10jenkins-bot: Rerunning Web browser extension survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099790 (https://phabricator.wikimedia.org/T380778) (owner: 10Jdrewniak) [21:09:52] (03Merged) 10jenkins-bot: Reenable non-UI experiment quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [21:09:54] (03Merged) 10jenkins-bot: Deploy Vector22 To Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100138 (https://phabricator.wikimedia.org/T381041) (owner: 10LorenMora) [21:10:00] alright things are kicked off, they will be ready for mwdebug in a few [21:10:14] Thanks! [21:10:22] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1099790|Rerunning Web browser extension survey (T380778)]], [[gerrit:1091749|Reenable non-UI experiment quick survey (T379241)]], [[gerrit:1100138|Deploy Vector22 To Wikis (T381041)]] [21:10:32] T380778: Simple summary experiment - Rerun QuickSurvey for browser extension - https://phabricator.wikimedia.org/T380778 [21:10:32] T379241: Set up quicksurveys for non-UI experiment pt 2 - https://phabricator.wikimedia.org/T379241 [21:10:33] T381041: Dec 3: Vector 2022 Deployments - https://phabricator.wikimedia.org/T381041 [21:11:52] (03CR) 10Andrew Bogott: [C:03+2] Initial (insetup) puppet for cloudcontrol1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1100195 (https://phabricator.wikimedia.org/T380499) (owner: 10Andrew Bogott) [21:12:28] RESOLVED: SystemdUnitCrashLoop: wdqs-updater.service crashloop on wdqs1027:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:12:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10377712 (10Andrew) a:05Andrew→03None [21:16:43] !log ebernhardson@deploy2002 bwang, ebernhardson, lmora, jdrewniak: Backport for [[gerrit:1099790|Rerunning Web browser extension survey (T380778)]], [[gerrit:1091749|Reenable non-UI experiment quick survey (T379241)]], [[gerrit:1100138|Deploy Vector22 To Wikis (T381041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:16:48] T380778: Simple summary experiment - Rerun QuickSurvey for browser extension - https://phabricator.wikimedia.org/T380778 [21:16:48] T379241: Set up quicksurveys for non-UI experiment pt 2 - https://phabricator.wikimedia.org/T379241 [21:16:49] T381041: Dec 3: Vector 2022 Deployments - https://phabricator.wikimedia.org/T381041 [21:21:43] (03PS2) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) [21:21:43] (03PS2) 10BCornwall: haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) [21:22:52] kimberly_sarabia: all up on mwdebug [21:23:10] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:23:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4626/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [21:23:29] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4627/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [21:24:12] ebernhardson: Thanks. Everything looks good! [21:24:20] ok, shipping it [21:24:24] !log ebernhardson@deploy2002 bwang, ebernhardson, lmora, jdrewniak: Continuing with sync [21:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10377734 (10phaultfinder) [21:28:23] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Backfill allocations for mw-parsoid LVS VIPs - swfrench@cumin2002" [21:28:28] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Backfill allocations for mw-parsoid LVS VIPs - swfrench@cumin2002" [21:28:28] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:29:27] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10377760 (10phaultfinder) [21:32:22] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099790|Rerunning Web browser extension survey (T380778)]], [[gerrit:1091749|Reenable non-UI experiment quick survey (T379241)]], [[gerrit:1100138|Deploy Vector22 To Wikis (T381041)]] (duration: 22m 00s) [21:32:27] T380778: Simple summary experiment - Rerun QuickSurvey for browser extension - https://phabricator.wikimedia.org/T380778 [21:32:28] T379241: Set up quicksurveys for non-UI experiment pt 2 - https://phabricator.wikimedia.org/T379241 [21:32:28] T381041: Dec 3: Vector 2022 Deployments - https://phabricator.wikimedia.org/T381041 [21:33:08] kimberly_sarabia: alright, fully shipped [21:33:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:33:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [21:33:31] ebernhardson: Thanks! [21:33:37] np [21:34:07] (03Merged) 10jenkins-bot: cirrus: Configure MLR buckets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:34:14] Here. [21:34:21] !incidents [21:34:21] 5506 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [21:34:22] 5505 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [21:34:22] 5504 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [21:34:22] 5503 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [21:34:27] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:37] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1099773|cirrus: Configure MLR buckets (T377128)]] [21:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10377788 (10phaultfinder) [21:34:39] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:35:14] denisse: here as well - this may be the same issue s.ukhe encountered earlier [21:36:36] once again, it's xe-2/0/45 toward cr1-eqiad [21:38:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [21:39:07] swfrench-wmf: yes, it looks like it's the same issue. [21:39:19] here, need more hands? [21:40:31] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1099773|cirrus: Configure MLR buckets (T377128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:40:34] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:43:21] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS13030/IPv6: OpenConfirm - Init7, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:45:21] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:45:37] (03Abandoned) 10Gergő Tisza: SUL3: Set $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099697 (https://phabricator.wikimedia.org/T377142) (owner: 10Gergő Tisza) [21:45:44] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [21:46:36] rzl: thanks! unless you have any brilliant ideas for tracking down the destination of this traffic flow, I'm not sure there's much we can do for the moment (it's also self resolved, though the port is definitely running hot since 19:50 or so). [21:49:31] yeah agreed, just dancing back and forth across the alert threshold [21:49:39] no inspiration on the cause though [21:49:59] ah, cr1-eqiad:xe-3/2/3 <> asw2-b-eqiad:xe-2/0/45 just fell back to baseline [21:50:14] i.e., what ever this was, it appears to be done for now [21:50:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:52:25] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099773|cirrus: Configure MLR buckets (T377128)]] (duration: 17m 47s) [21:52:28] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:54:27] FIRING: [10x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:33] (03PS1) 10CDanis: chart-renderer: scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100202 (https://phabricator.wikimedia.org/T379687) [21:54:34] FIRING: [8x] ProbeDown: Service wdqs1027:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:55:46] swfrench-wmf: rzl: do we know the source host? [21:56:01] or if it is internal or external [21:57:02] cdanis: no, my librenms is still pretty weak [21:57:29] eheh yeah [21:58:29] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100203 [21:59:42] FIRING: [10x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:46] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446 (10BCornwall) 03NEW [21:59:48] (03CR) 10Krinkle: [C:03+1] webperf: set statsd exporter timer type to histogram (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099821 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [22:00:12] (03CR) 10Krinkle: [C:03+1] prometheus: restart statsd-exporter on config change [puppet] - 10https://gerrit.wikimedia.org/r/1099822 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [22:01:45] (03PS2) 10Cwhite: webperf: disable statsd-exporter relaying flag [puppet] - 10https://gerrit.wikimedia.org/r/1099796 (https://phabricator.wikimedia.org/T355837) [22:02:08] (03CR) 10Krinkle: "boldly rebasing on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1099720 as I think we have to point statsv.py to statsd before dis" [puppet] - 10https://gerrit.wikimedia.org/r/1099796 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [22:02:28] (03PS2) 10Cwhite: prometheus: restart statsd-exporter on config change [puppet] - 10https://gerrit.wikimedia.org/r/1099822 (https://phabricator.wikimedia.org/T355837) [22:02:42] (03PS4) 10Krinkle: webperf: set statsv.py --statsd to statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) [22:02:46] (03CR) 10Krinkle: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [22:02:53] (03PS3) 10Cwhite: webperf: disable statsd-exporter relaying flag [puppet] - 10https://gerrit.wikimedia.org/r/1099796 (https://phabricator.wikimedia.org/T355837) [22:03:05] (03PS2) 10Cwhite: webperf: set statsd exporter timer type to histogram [puppet] - 10https://gerrit.wikimedia.org/r/1099821 (https://phabricator.wikimedia.org/T355837) [22:04:27] FIRING: [10x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:07:19] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10377857 (10BCornwall) [22:07:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:08:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [22:09:47] !incidents [22:09:48] 5507 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:09:48] 5508 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [22:09:48] 5506 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [22:09:48] 5505 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [22:09:48] 5504 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [22:09:49] 5503 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [22:10:07] !log brett@cumin2002 START - Cookbook sre.dns.netbox [22:12:24] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:13:17] (03CR) 10Jdlrobson: [C:03+1] chart-renderer: scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100202 (https://phabricator.wikimedia.org/T379687) (owner: 10CDanis) [22:15:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:15:22] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:15:23] <_Gerges> Hi, Does the T381445 task need community consensus? [22:15:24] T381445: Add "Noto Sans Arabic" Font - https://phabricator.wikimedia.org/T381445 [22:17:04] (03CR) 10BCornwall: [V:03+1 C:03+2] "Thanks for pointing that out; I've created I171b93157479ec4d3decc5e9487e1bf2f70714fe" [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [22:17:12] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377873 (10VRiley-WMF) [22:21:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1087.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1089.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1088.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1090.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1089.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:23:40] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1083.eqiad.wmnet with OS bullseye [22:23:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-be1083.eqiad.wmnet with OS bullseye [22:27:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377881 (10VRiley-WMF) [22:28:15] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host [22:29:27] FIRING: [10x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:21] !log dancy@deploy2002 Installing scap version "4.132.0" for 1 host(s) [22:31:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1087.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:31:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1088.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:31:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1090.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:32:12] !log dancy@deploy2002 Installation of scap version "4.132.0" completed for 1 hosts [22:32:14] !log ryankemper@deploy2002 deploy aborted: deploy to fresh wdqs-internal-scholarly host (duration: 03m 59s) [22:32:18] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host [22:32:31] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-scholarly host (duration: 00m 13s) [22:32:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1087.eqiad.wmnet with OS bullseye [22:32:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1087.eqiad.wmnet with OS bullseye [22:32:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1090.eqiad.wmnet with OS bullseye [22:32:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS bullseye [22:32:59] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1090.eqiad.wmnet with OS bullseye [22:33:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1088.eqiad.wmnet with OS bullseye [22:34:01] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1083.eqiad.wmnet with reason: host reimage [22:34:27] FIRING: [10x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs[1026-1027].eqiad.wmnet with reason: T376150 [22:35:27] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [22:35:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs[1026-1027].eqiad.wmnet with reason: T376150 [22:37:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1083.eqiad.wmnet with reason: host reimage [22:38:19] (03PS1) 10Ahmon Dancy: bootstrap-scap-target.sh: Temp hard code scap version [puppet] - 10https://gerrit.wikimedia.org/r/1100204 (https://phabricator.wikimedia.org/T380772) [22:38:33] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host [22:38:47] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: deploy to fresh wdqs-internal-main host (duration: 00m 13s) [22:40:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377943 (10VRiley-WMF) [22:42:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1089.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:43:09] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1087.eqiad.wmnet with reason: host reimage [22:43:32] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [22:43:33] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [22:43:36] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [22:43:37] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1090.eqiad.wmnet with reason: host reimage [22:44:13] (03CR) 10Ahmon Dancy: "ryankemper said that was able to work around the https://phabricator.wikimedia.org/P71504 problem, so this is not urgent anymore." [puppet] - 10https://gerrit.wikimedia.org/r/1100204 (https://phabricator.wikimedia.org/T380772) (owner: 10Ahmon Dancy) [22:46:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1087.eqiad.wmnet with reason: host reimage [22:50:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [22:52:21] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:52:32] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:52:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1089.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:52:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1090.eqiad.wmnet with reason: host reimage [22:53:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1089.eqiad.wmnet with OS bullseye [22:53:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1089.eqiad.wmnet with OS bullseye [22:53:49] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:54:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377961 (10Jclark-ctr) [22:57:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:57:38] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:59:30] FIRING: [6x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:00:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:01:40] ^^ expected...our setup cook-books remove downtimes when they're finished, but these hosts aren't quite ready yet ;) [23:01:58] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[1026-1027].eqiad.wmnet with reason: T376150 [23:02:00] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [23:02:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[1026-1027].eqiad.wmnet with reason: T376150 [23:02:26] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[2018-2020,2026-2027].codfw.wmnet with reason: T376150 [23:02:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[2018-2020,2026-2027].codfw.wmnet with reason: T376150 [23:03:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [23:04:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1084.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:04:21] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1089.eqiad.wmnet with reason: host reimage [23:08:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:08:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1089.eqiad.wmnet with reason: host reimage [23:08:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:08:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1088.eqiad.wmnet with OS bullseye [23:08:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1088.eqiad.wmnet with OS bullseye complete... [23:11:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:11:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:11:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1090.eqiad.wmnet with OS bullseye [23:11:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10377988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1090.eqiad.wmnet with OS bullseye complete... [23:11:51] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:12:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:12:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1087.eqiad.wmnet with OS bullseye [23:12:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1087.eqiad.wmnet with OS bullseye complete... [23:12:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378001 (10Jclark-ctr) [23:14:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10378004 (10phaultfinder) [23:16:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378005 (10VRiley-WMF) [23:19:04] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:19:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [23:19:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1083.eqiad.wmnet with OS bullseye [23:19:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-be1083.eqiad.wmnet with OS bullseye complete... [23:20:03] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1084.eqiad.wmnet with OS bullseye [23:20:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-be1084.eqiad.wmnet with OS bullseye [23:21:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378012 (10Jclark-ctr) [23:22:32] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [23:22:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [23:22:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:25:52] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:27:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:27:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1089.eqiad.wmnet with OS bullseye [23:27:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1089.eqiad.wmnet with OS bullseye complete... [23:28:34] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1086 [23:29:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1086 [23:30:33] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1084.eqiad.wmnet with reason: host reimage [23:34:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1084.eqiad.wmnet with reason: host reimage [23:34:26] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:34:30] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:42] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:34:44] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:44] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:50] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:35:52] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [23:36:05] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [23:36:08] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [23:36:35] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [23:37:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:39:37] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be1085 - vriley@cumin1002" [23:39:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be1085 - vriley@cumin1002" [23:39:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:40:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378038 (10VRiley-WMF) [23:40:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1041.eqiad.wmnet with OS bookworm [23:40:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm [23:41:11] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1042.eqiad.wmnet with OS bookworm [23:41:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [23:41:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1042.eqiad.wmnet with OS bookworm [23:41:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm [23:42:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: wdqs1025 fails to PXE boot, NIC shows "no link" in DRAC web UI - https://phabricator.wikimedia.org/T381283#10378033 (10Jclark-ctr) 05Open→03Resolved @bking replaced cable link came up sorry for delay [23:42:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:48:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1044.eqiad.wmnet with OS bookworm [23:48:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm [23:48:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1085.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:49:59] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1085.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:52:22] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [23:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10378045 (10phaultfinder) [23:58:50] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply