[00:05:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057975 (owner: 10TrainBranchBot) [00:17:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T367856)', diff saved to https://phabricator.wikimedia.org/P67008 and previous config saved to /var/cache/conftool/dbconfig/20240730-001710-marostegui.json [00:17:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:19:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:59] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10025868 (10Dwisehaupt) Updated frqueue2003 mgmt dns name in netbox from `frqueue` to `frqueue2003`. Pushed out the dns change using the runbook. Updated bastion iptables rules f... [00:22:03] FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [00:24:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P67009 and previous config saved to /var/cache/conftool/dbconfig/20240730-003218-marostegui.json [00:47:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P67010 and previous config saved to /var/cache/conftool/dbconfig/20240730-004725-marostegui.json [01:02:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T367856)', diff saved to https://phabricator.wikimedia.org/P67011 and previous config saved to /var/cache/conftool/dbconfig/20240730-010232-marostegui.json [01:02:50] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:08:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.16 [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1057987 (https://phabricator.wikimedia.org/T366961) [01:08:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.16 [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1057987 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [01:09:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:25:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:19] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.16 [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1057987 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [01:50:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [01:59:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T0200) [02:04:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:19:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T0300) [03:00:39] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:25] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057999 (https://phabricator.wikimedia.org/T366961) [03:01:27] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057999 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [03:02:06] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057999 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [03:02:23] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.16 refs T366961 [03:02:28] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [03:04:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:39:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T0400) [04:07:03] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.13 (duration: 06m 51s) [04:08:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026111 (10Marostegui) @volans any idea why this may be happening during this reimage? I don't see anything different within its puppet definition that could explain why. [04:15:25] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:03] FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [04:23:24] (03PS1) 10Marostegui: Revert "db2179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1058004 [04:23:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67013 and previous config saved to /var/cache/conftool/dbconfig/20240730-042358-root.json [04:24:01] (03CR) 10Marostegui: [C:03+2] Revert "db2179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1058004 (owner: 10Marostegui) [04:25:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T371251 [04:25:19] T371251: Switchover s4 master (db1238 -> db1160) - https://phabricator.wikimedia.org/T371251 [04:25:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1160 with weight 0 T371251', diff saved to https://phabricator.wikimedia.org/P67014 and previous config saved to /var/cache/conftool/dbconfig/20240730-042528-root.json [04:25:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T371251 [04:26:18] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057866 (https://phabricator.wikimedia.org/T371251) (owner: 10Gerrit maintenance bot) [04:27:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [04:27:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [04:27:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T367856)', diff saved to https://phabricator.wikimedia.org/P67015 and previous config saved to /var/cache/conftool/dbconfig/20240730-042755-marostegui.json [04:28:03] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:29:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:41] (03PS1) 10Marostegui: db1238,db1244: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058005 [04:31:49] (03CR) 10Marostegui: [C:03+2] db1238,db1244: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058005 (owner: 10Marostegui) [04:34:10] (03PS1) 10Marostegui: installserver: Do not reimage pc2017 [puppet] - 10https://gerrit.wikimedia.org/r/1058006 [04:37:11] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage pc2017 [puppet] - 10https://gerrit.wikimedia.org/r/1058006 (owner: 10Marostegui) [04:39:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67016 and previous config saved to /var/cache/conftool/dbconfig/20240730-043904-root.json [04:45:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:06] !log Starting s4 eqiad failover from db1238 to db1160 - T371251 [04:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:10] T371251: Switchover s4 master (db1238 -> db1160) - https://phabricator.wikimedia.org/T371251 [04:50:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T371251', diff saved to https://phabricator.wikimedia.org/P67017 and previous config saved to /var/cache/conftool/dbconfig/20240730-045032-root.json [04:51:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T371251', diff saved to https://phabricator.wikimedia.org/P67018 and previous config saved to /var/cache/conftool/dbconfig/20240730-045104-marostegui.json [04:51:34] (03CR) 10Marostegui: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1057867 (https://phabricator.wikimedia.org/T371251) (owner: 10Gerrit maintenance bot) [04:53:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1238 T371251', diff saved to https://phabricator.wikimedia.org/P67019 and previous config saved to /var/cache/conftool/dbconfig/20240730-045336-marostegui.json [04:54:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67020 and previous config saved to /var/cache/conftool/dbconfig/20240730-045409-root.json [04:55:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Long schema change [04:55:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Long schema change [04:55:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Recloning db1238 [04:55:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Recloning db1238 [05:04:36] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1244.eqiad.wmnet onto db1238.eqiad.wmnet [05:09:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67021 and previous config saved to /var/cache/conftool/dbconfig/20240730-050914-root.json [05:12:42] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342 (10Marostegui) 03NEW [05:12:46] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10026169 (10Marostegui) p:05Triage→03High [05:15:56] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10026182 (10Marostegui) >>! In T370852#10010096, @Ladsgroup wrote: > This should have the map: https://fault-tolerance.toolforge.org/map?cluster=s1 Jus... [05:18:42] (03PS1) 10Marostegui: db1244: Make it candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058011 (https://phabricator.wikimedia.org/T371343) [05:19:12] (03CR) 10Marostegui: [C:03+2] db1244: Make it candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058011 (https://phabricator.wikimedia.org/T371343) (owner: 10Marostegui) [05:19:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:21] !log Change candidate master in s4 eqiad (this is a NOOP) T371343 [05:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:26] T371343: Prepare new candidate master for s4 - https://phabricator.wikimedia.org/T371343 [05:20:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67022 and previous config saved to /var/cache/conftool/dbconfig/20240730-052420-root.json [05:34:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T0600) [06:00:05] marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T0600). nyaa~ [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:29] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend andyrussg until the end of August. [puppet] - 10https://gerrit.wikimedia.org/r/1057877 (owner: 10Slyngshede) [06:18:43] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10026230 (10ayounsi) We can potentially re-use the `move_server.MoveServer` script but make the server selection a `MultiObjectVar` as input and make the rack U... [06:24:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:29:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1058025 (https://phabricator.wikimedia.org/T371345) [06:40:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T371345 [06:41:19] T371345: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T371345 [06:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2212 with weight 0 T371345', diff saved to https://phabricator.wikimedia.org/P67023 and previous config saved to /var/cache/conftool/dbconfig/20240730-064128-marostegui.json [06:41:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T371345 [06:43:18] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1058025 (https://phabricator.wikimedia.org/T371345) (owner: 10Gerrit maintenance bot) [06:44:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:32] (03PS1) 10Fabfur: Revert "geo-maps: make esams default DC for France" [dns] - 10https://gerrit.wikimedia.org/r/1058026 [06:48:15] (03PS2) 10Fabfur: Revert "geo-maps: make esams default DC for France" [dns] - 10https://gerrit.wikimedia.org/r/1058026 [06:48:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2212', diff saved to https://phabricator.wikimedia.org/P67024 and previous config saved to /var/cache/conftool/dbconfig/20240730-064835-root.json [06:48:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2216', diff saved to https://phabricator.wikimedia.org/P67025 and previous config saved to /var/cache/conftool/dbconfig/20240730-064853-root.json [06:56:47] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2216.codfw.wmnet onto db2212.codfw.wmnet [06:58:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1244.eqiad.wmnet onto db1238.eqiad.wmnet [06:58:56] (03CR) 10Ayounsi: [C:03+1] Revert "geo-maps: make esams default DC for France" [dns] - 10https://gerrit.wikimedia.org/r/1058026 (owner: 10Fabfur) [06:59:22] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:24] (03CR) 10Fabfur: [C:03+2] Revert "geo-maps: make esams default DC for France" [dns] - 10https://gerrit.wikimedia.org/r/1058026 (owner: 10Fabfur) [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67026 and previous config saved to /var/cache/conftool/dbconfig/20240730-070114-root.json [07:02:46] (03PS1) 10Marostegui: db1238: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058027 [07:03:22] (03CR) 10Marostegui: [C:03+2] db1238: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058027 (owner: 10Marostegui) [07:10:14] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy2002.codfw.wmnet with OS bullseye [07:14:19] !log finish rolling out benthos 4.27.0-1 [07:14:22] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67027 and previous config saved to /var/cache/conftool/dbconfig/20240730-071454-root.json [07:14:59] (03PS1) 10Marostegui: db1244: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058029 [07:15:37] (03CR) 10Marostegui: [C:03+2] db1244: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058029 (owner: 10Marostegui) [07:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67028 and previous config saved to /var/cache/conftool/dbconfig/20240730-071619-root.json [07:19:22] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:32] (03PS2) 10Alexandros Kosiaris: site.pp: Add hostnames for the new mw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1053299 (https://phabricator.wikimedia.org/T368933) [07:22:26] (03CR) 10Alexandros Kosiaris: [C:03+2] site.pp: Add hostnames for the new mw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1053299 (https://phabricator.wikimedia.org/T368933) (owner: 10Alexandros Kosiaris) [07:23:13] (03CR) 10Filippo Giunchedi: [C:03+2] burrow: restart on failure [puppet] - 10https://gerrit.wikimedia.org/r/1057886 (https://phabricator.wikimedia.org/T366573) (owner: 10Filippo Giunchedi) [07:28:21] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy2002.codfw.wmnet with reason: host reimage [07:30:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67029 and previous config saved to /var/cache/conftool/dbconfig/20240730-072959-root.json [07:30:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67030 and previous config saved to /var/cache/conftool/dbconfig/20240730-073124-root.json [07:33:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy2002.codfw.wmnet with reason: host reimage [07:34:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:24] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix dell_config_changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1057927 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [07:42:22] (03PS1) 10Kevin Bazira: ml-services: move logo-detection isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058031 (https://phabricator.wikimedia.org/T370757) [07:44:41] (03PS19) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [07:44:55] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [07:45:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67031 and previous config saved to /var/cache/conftool/dbconfig/20240730-074505-root.json [07:46:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67032 and previous config saved to /var/cache/conftool/dbconfig/20240730-074629-root.json [07:47:17] (03PS2) 10Jelto: gerrit: add nft throttling on replica but don't enable yet [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:47:18] (03CR) 10Jelto: [V:03+1 C:03+1] "this looks good to me now, absent on gerrit1003 and present on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:49:10] 06SRE, 10conftool, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252#10026379 (10elukey) 05Open→03Resolved a:03elukey [07:53:24] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058031 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [07:53:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026382 (10Volans) Has it been cleared from puppet5 between each reimage? After the first reimage if the host is in puppetdb the reimage cookbook will use the current pu... [07:54:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026384 (10Marostegui) @Volans after the first reimage that failed, I did clean all the certificates from puppet, but looks like Papaul found the same issues. [08:00:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67033 and previous config saved to /var/cache/conftool/dbconfig/20240730-080010-root.json [08:01:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67034 and previous config saved to /var/cache/conftool/dbconfig/20240730-080135-root.json [08:01:53] (03PS1) 10Slyngshede: P:openldap::management: Add ops-limited to cross validation. [puppet] - 10https://gerrit.wikimedia.org/r/1058081 (https://phabricator.wikimedia.org/T360356) [08:02:21] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [08:03:20] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [08:03:49] (03PS1) 10Marostegui: mariadb: Move db1224 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/1058082 (https://phabricator.wikimedia.org/T371276) [08:04:22] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1224.eqiad.wmnet with reason: Move db1224 to x1 [08:05:19] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [08:05:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1224.eqiad.wmnet with reason: Move db1224 to x1 [08:05:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179 T371276', diff saved to https://phabricator.wikimedia.org/P67035 and previous config saved to /var/cache/conftool/dbconfig/20240730-080538-root.json [08:05:49] T371276: Add one more replica to x1 - https://phabricator.wikimedia.org/T371276 [08:05:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1179.eqiad.wmnet with reason: Move db1224 to x1 [08:06:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1179.eqiad.wmnet with reason: Move db1224 to x1 [08:06:17] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1179.eqiad.wmnet onto db1224.eqiad.wmnet [08:06:41] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058031 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [08:06:49] !log Update db1224 on zarcillo T371276 [08:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:34] (03Merged) 10jenkins-bot: ml-services: move logo-detection isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058031 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [08:09:10] (03PS1) 10Marostegui: db1179: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058083 [08:09:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:39] (03CR) 10Marostegui: [C:03+2] db1179: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058083 (owner: 10Marostegui) [08:10:01] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1224 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/1058082 (https://phabricator.wikimedia.org/T371276) (owner: 10Marostegui) [08:11:12] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [08:11:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy2002.codfw.wmnet with OS bullseye [08:15:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67037 and previous config saved to /var/cache/conftool/dbconfig/20240730-081515-root.json [08:15:40] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:19:08] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10026442 (10MatthewVernon) Looking today, they are set as follows: Americas 1: 7am - 3pm America/Chicago [12:00 - 20:00 UTC I think?] Americas 2: 11am - 7pm America... [08:19:34] (03PS1) 10Hashar: (DO NOT SUBMIT) test logstash-filter-verifier [puppet] - 10https://gerrit.wikimedia.org/r/1058085 (https://phabricator.wikimedia.org/T371285) [08:20:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:04] FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:22:28] (03Abandoned) 10Hashar: (DO NOT SUBMIT) test logstash-filter-verifier [puppet] - 10https://gerrit.wikimedia.org/r/1058085 (https://phabricator.wikimedia.org/T371285) (owner: 10Hashar) [08:24:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T371345 [08:24:19] T371345: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T371345 [08:24:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T371345 [08:29:37] (03CR) 10Vgutierrez: [C:03+1] "those should be removed in a following commit after this one gets merged and puppet cleans up benthos in ulsfo. Sadly `profile::benthos` d" [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [08:30:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67038 and previous config saved to /var/cache/conftool/dbconfig/20240730-083020-root.json [08:32:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2216.codfw.wmnet onto db2212.codfw.wmnet [08:40:22] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10026510 (10MatthewVernon) @NBaca-WMF thanks for the update :) If you could let me know when you've got planned timescales for this, that'd be helpful, pl... [08:43:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026516 (10Volans) The reimage cookbook did set it to puppet 7 each time: ` $ grep "node=db2227.codfw.wmnet, command='printf" reimage-extended.log 2024-07-29 12:27:45,34... [08:44:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026518 (10Marostegui) @volans you can reimage anytime. [08:45:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67039 and previous config saved to /var/cache/conftool/dbconfig/20240730-084525-root.json [08:46:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [08:46:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [08:54:22] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:11] (03PS1) 10Hashar: (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 [08:58:42] (03CR) 10CI reject: [V:04-1] (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 (owner: 10Hashar) [08:59:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:59] (03CR) 10Fabfur: "re following commit: ack" [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [09:02:12] (03PS2) 10Hashar: (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 [09:04:44] (03CR) 10CI reject: [V:04-1] (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 (owner: 10Hashar) [09:05:29] (03PS3) 10Hashar: (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 [09:07:55] (03CR) 10CI reject: [V:04-1] (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 (owner: 10Hashar) [09:08:13] (03PS1) 10MVernon: site.pp: new thanos backends are safe to add to thanos::backend [puppet] - 10https://gerrit.wikimedia.org/r/1058092 (https://phabricator.wikimedia.org/T370453) [09:08:41] (03CR) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:09:47] (03PS2) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [09:09:47] (03PS3) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [09:10:39] (03CR) 10Filippo Giunchedi: [C:03+1] site.pp: new thanos backends are safe to add to thanos::backend [puppet] - 10https://gerrit.wikimedia.org/r/1058092 (https://phabricator.wikimedia.org/T370453) (owner: 10MVernon) [09:10:39] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:40] !log Starting s1 codfw failover from db2203 to db2212 - T371345 [09:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:45] T371345: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T371345 [09:12:19] (03CR) 10MVernon: [C:03+2] site.pp: new thanos backends are safe to add to thanos::backend [puppet] - 10https://gerrit.wikimedia.org/r/1058092 (https://phabricator.wikimedia.org/T370453) (owner: 10MVernon) [09:14:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:43] (03PS4) 10Hashar: (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 [09:15:03] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10026645 (10Fabfur) [09:16:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10026651 (10MatthewVernon) a:05MatthewVernon→03None [09:17:07] (03CR) 10CI reject: [V:04-1] (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 (owner: 10Hashar) [09:17:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10026648 (10MatthewVernon) @RobH done, thanks. [09:17:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2212 to s1 primary T371345', diff saved to https://phabricator.wikimedia.org/P67040 and previous config saved to /var/cache/conftool/dbconfig/20240730-091742-root.json [09:17:52] T371345: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T371345 [09:18:44] (03CR) 10Klausman: [C:03+1] ml-services: move logo-detection isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058031 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [09:19:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2203 T371345', diff saved to https://phabricator.wikimedia.org/P67041 and previous config saved to /var/cache/conftool/dbconfig/20240730-091925-marostegui.json [09:20:37] (03PS3) 10Hnowlan: mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241) [09:20:46] (03PS7) 10Hnowlan: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [09:21:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2203.codfw.wmnet with reason: Long schema change [09:21:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2203.codfw.wmnet with reason: Long schema change [09:25:39] (03PS1) 10Marostegui: db2203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058098 [09:25:53] (03PS1) 10Klausman: hiera/deployment-server: create logo-detection config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1058097 [09:26:10] (03CR) 10Marostegui: [C:03+2] db2203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058098 (owner: 10Marostegui) [09:26:40] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [09:26:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [09:29:57] !log Deploy schema change on db2203 s1 codfw dbmaint T367856 [09:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:02] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:30:47] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3447/co" [puppet] - 10https://gerrit.wikimedia.org/r/1058097 (owner: 10Klausman) [09:31:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67042 and previous config saved to /var/cache/conftool/dbconfig/20240730-093138-root.json [09:33:15] (03CR) 10Elukey: [C:03+2] admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:38:46] (03CR) 10Ilias Sarantopoulos: [C:03+1] hiera/deployment-server: create logo-detection config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1058097 (owner: 10Klausman) [09:39:21] (03CR) 10Klausman: [V:03+1 C:03+2] hiera/deployment-server: create logo-detection config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1058097 (owner: 10Klausman) [09:40:16] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#10026702 (10elukey) Finally the change is being rolled out! ` elukey@an-worker1080:~$ id wpao uid=21258(wpao) gid=500(wikidev) groups=500(wikidev),4(adm),724(ops-limited... [09:42:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1179.eqiad.wmnet onto db1224.eqiad.wmnet [09:42:36] (03PS1) 10Marostegui: db1224: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058100 [09:42:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67043 and previous config saved to /var/cache/conftool/dbconfig/20240730-094256-root.json [09:44:47] (03CR) 10Marostegui: [C:03+2] db1224: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058100 (owner: 10Marostegui) [09:45:01] (03PS1) 10Elukey: admin: deprecate sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1058101 (https://phabricator.wikimedia.org/T360356) [09:45:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67044 and previous config saved to /var/cache/conftool/dbconfig/20240730-094549-root.json [09:46:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67045 and previous config saved to /var/cache/conftool/dbconfig/20240730-094643-root.json [09:49:09] (03PS1) 10Marostegui: db1179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058102 [09:50:50] (03CR) 10Marostegui: [C:03+2] db1179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1058102 (owner: 10Marostegui) [09:51:35] (03CR) 10Clément Goubert: Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:51:49] RESOLVED: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:58:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67046 and previous config saved to /var/cache/conftool/dbconfig/20240730-095802-root.json [09:59:16] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10026752 (10hnowlan) >>! In T370783#10024896, @jhathaway wrote: >>>! In T370783#10023283, @hnowlan wrote: >> Are we classifying "incident issue closed" as resolved? > > Resolved maps well to our do... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1000) [10:00:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67047 and previous config saved to /var/cache/conftool/dbconfig/20240730-100055-root.json [10:01:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67048 and previous config saved to /var/cache/conftool/dbconfig/20240730-100148-root.json [10:02:05] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [10:02:23] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: add nft throttling on replica but don't enable yet [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [10:03:53] (03PS1) 10Filippo Giunchedi: grafana: set timeinterval 60s for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) [10:04:05] (03PS1) 10Klausman: admin_ng/LiftWing: add logo-detection namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058105 [10:06:15] 06SRE, 06serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889#10026798 (10jijiki) [10:06:37] 06SRE, 06Data-Persistence, 06serviceops, 07Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#10026813 (10jijiki) [10:08:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [10:09:12] (03PS1) 10Clément Goubert: scap: Remove legacy appserver clusters [puppet] - 10https://gerrit.wikimedia.org/r/1058099 (https://phabricator.wikimedia.org/T367949) [10:09:18] (03CR) 10Elukey: [C:03+1] admin_ng/LiftWing: add logo-detection namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058105 (owner: 10Klausman) [10:11:32] (03PS2) 10Klausman: admin_ng/LiftWing: add logo-detection namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058105 [10:12:42] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng/LiftWing: add logo-detection namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058105 (owner: 10Klausman) [10:13:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67049 and previous config saved to /var/cache/conftool/dbconfig/20240730-101307-root.json [10:13:14] (03CR) 10Klausman: [C:03+2] admin_ng/LiftWing: add logo-detection namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058105 (owner: 10Klausman) [10:14:42] !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [10:14:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin2002 for host db2227.codfw.wmnet with OS bookworm [10:15:25] RESOLVED: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67050 and previous config saved to /var/cache/conftool/dbconfig/20240730-101600-root.json [10:16:37] (03Merged) 10jenkins-bot: admin_ng/LiftWing: add logo-detection namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058105 (owner: 10Klausman) [10:16:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67051 and previous config saved to /var/cache/conftool/dbconfig/20240730-101654-root.json [10:20:10] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:20:11] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:20:20] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:20:21] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:20:28] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:20:29] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:20:41] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:21:35] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:28:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67052 and previous config saved to /var/cache/conftool/dbconfig/20240730-102813-root.json [10:29:09] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2227.codfw.wmnet with reason: host reimage [10:31:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67053 and previous config saved to /var/cache/conftool/dbconfig/20240730-103106-root.json [10:32:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67054 and previous config saved to /var/cache/conftool/dbconfig/20240730-103200-root.json [10:32:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026869 (10Volans) I've just cleaned the host from both puppetmaster and puppetserver to be sure it was not there and run: ` sudo cookbook sre.hosts.reimage -t T369654 -... [10:32:48] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: host reimage [10:33:06] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [10:35:03] jnuche: it looks like the train hasn't correctly distributed the new version to all servers [10:35:35] (03PS1) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [10:36:22] (03PS1) 10Fabfur: admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) [10:36:56] claime: yeah, the presync failed last night, I'm trying to fix an issue with scap and then I'll rerun the presync again [10:37:00] ack [10:37:16] (03CR) 10CI reject: [V:04-1] admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [10:40:26] (03PS2) 10Fabfur: admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) [10:41:21] (03CR) 10CI reject: [V:04-1] admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [10:42:51] (03CR) 10Kamila Součková: [C:03+1] mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:42:54] (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:42:56] (03PS3) 10Fabfur: admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) [10:43:06] (03PS4) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) [10:43:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67056 and previous config saved to /var/cache/conftool/dbconfig/20240730-104318-root.json [10:43:50] (03CR) 10CI reject: [V:04-1] admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [10:44:04] (03CR) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:44:28] (03CR) 10Stevemunene: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:45:48] (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67057 and previous config saved to /var/cache/conftool/dbconfig/20240730-104612-root.json [10:46:37] (03PS4) 10Fabfur: admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) [10:47:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67058 and previous config saved to /var/cache/conftool/dbconfig/20240730-104705-root.json [10:49:09] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - volans@cumin2002" [10:50:24] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - volans@cumin2002" [10:50:25] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2227.codfw.wmnet with OS bookworm [10:50:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin2002 for host db2227.codfw.wmnet with OS bookworm completed: - db2227 (**PASS*... [10:51:36] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:53:01] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:54:21] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:23] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:54:24] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:55:00] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:55:14] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:55:24] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:55:25] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:55:31] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:55:55] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:56:58] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:58:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67059 and previous config saved to /var/cache/conftool/dbconfig/20240730-105825-root.json [10:58:45] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:59:07] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:00:01] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:00:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10026919 (10Fabfur) [11:00:40] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:00:58] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:01:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67060 and previous config saved to /var/cache/conftool/dbconfig/20240730-110117-root.json [11:01:20] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:01:23] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [11:01:47] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:01:57] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:02:21] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:02:32] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:02:44] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:02:51] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:02:58] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:03:06] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:03:58] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:05:25] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [11:05:56] ^^ thx slyngs [11:10:10] (03PS4) 10Cathal Mooney: lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) [11:10:13] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [11:13:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10026928 (10Marostegui) >>! In T369654#10026869, @Volans wrote: > I've just cleaned the host from both puppetmaster and puppetserver to be sure it was not there and run:... [11:13:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67061 and previous config saved to /var/cache/conftool/dbconfig/20240730-111331-root.json [11:16:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67062 and previous config saved to /var/cache/conftool/dbconfig/20240730-111622-root.json [11:18:16] (03PS1) 10Urbanecm: refreshLinkRecommendations: Work even when link-recommendation is disabled [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058116 (https://phabricator.wikimedia.org/T371316) [11:18:27] (03PS1) 10Urbanecm: refreshLinkRecommendations: Work even when link-recommendation is disabled [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058117 (https://phabricator.wikimedia.org/T371316) [11:22:04] jouncebot: nowandnext [11:22:04] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [11:22:04] In 0 hour(s) and 37 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1200) [11:22:22] (03CR) 10Urbanecm: [C:03+2] refreshLinkRecommendations: Work even when link-recommendation is disabled [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058117 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [11:22:26] (03CR) 10Urbanecm: [C:03+2] refreshLinkRecommendations: Work even when link-recommendation is disabled [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058116 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [11:25:06] (03PS1) 10Clément Goubert: SRELBBatchRunnerBase: Add depool_status attribute [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 [11:25:56] (03PS1) 10Clément Goubert: sre.k8s.reboot-nodes: Set pooled=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 [11:32:03] (03PS1) 10Hnowlan: thumbor: filter duplicate error messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058123 (https://phabricator.wikimedia.org/T368180) [11:32:05] (03PS2) 10Cathal Mooney: lvs2013: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056563 (https://phabricator.wikimedia.org/T370927) [11:33:30] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.k8s.reboot-nodes: Set pooled=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 (owner: 10Clément Goubert) [11:39:32] (03CR) 10Volans: [C:03+1] "LGTM, module the unrelated changes" [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 (owner: 10Clément Goubert) [11:40:19] (03CR) 10Clément Goubert: [C:03+1] thumbor: filter duplicate error messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058123 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [11:40:43] (03CR) 10Volans: sre.k8s.reboot-nodes: Set pooled=inactive (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 (owner: 10Clément Goubert) [11:41:19] (03CR) 10Hnowlan: [C:03+2] thumbor: filter duplicate error messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058123 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [11:42:16] (03Merged) 10jenkins-bot: thumbor: filter duplicate error messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058123 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [11:47:47] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [11:47:54] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:49:22] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:00] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:53:11] (03PS3) 10Urbanecm: [Growth] hywwiki: Disable Add link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057970 (https://phabricator.wikimedia.org/T370558) [11:53:14] (03CR) 10Urbanecm: [C:03+2] [Growth] hywwiki: Disable Add link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057970 (https://phabricator.wikimedia.org/T370558) (owner: 10Urbanecm) [11:53:32] jouncebot: nowandnext [11:53:32] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [11:53:32] In 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1200) [11:53:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057970 (https://phabricator.wikimedia.org/T370558) (owner: 10Urbanecm) [11:53:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058117 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [11:53:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058116 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [11:53:54] (03Merged) 10jenkins-bot: [Growth] hywwiki: Disable Add link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057970 (https://phabricator.wikimedia.org/T370558) (owner: 10Urbanecm) [11:54:53] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:58:16] a TLS error?! [11:58:20] 11:58:01 backport failed: Could not find a suitable TLS CA certificate bundle, invalid path: /var/lib/scap/scap/lib/python3.9/site-packages/certifi/cacert.pem [11:58:43] that path appears to be valid... [11:59:00] (03PS1) 10Marostegui: Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1058137 [11:59:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058117 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [11:59:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058116 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [11:59:02] trying again [11:59:05] urbanecm: really sorry, I missed your backport, I was trying something out with scap [11:59:11] jnuche: ah, so it was you! :) [11:59:12] makes sense [11:59:26] please try again [11:59:31] yep, now it appears to work [11:59:33] I'll wait until you're finished [11:59:46] thanks, i'll ping you when done [11:59:48] (03CR) 10Marostegui: [C:03+2] Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1058137 (owner: 10Marostegui) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1200) [12:00:07] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10027115 (10Ladsgroup) Yeah. If we can build a public API from zarcillo, it'd would make the whole easier. [12:01:03] (03Merged) 10jenkins-bot: refreshLinkRecommendations: Work even when link-recommendation is disabled [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058117 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [12:01:50] (03Merged) 10jenkins-bot: refreshLinkRecommendations: Work even when link-recommendation is disabled [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058116 (https://phabricator.wikimedia.org/T371316) (owner: 10Urbanecm) [12:01:53] finally [12:02:33] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1057970|[Growth] hywwiki: Disable Add link backend (T370558)]], [[gerrit:1058117|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]], [[gerrit:1058116|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]] [12:02:39] (03CR) 10Alexandros Kosiaris: [C:03+1] scap: Remove legacy appserver clusters [puppet] - 10https://gerrit.wikimedia.org/r/1058099 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:02:39] T370558: Disable Add Link backend on hywwiki - https://phabricator.wikimedia.org/T370558 [12:02:39] T371316: refreshLinkRecommendations.php does not run when link-recommendation task type is disabled - https://phabricator.wikimedia.org/T371316 [12:04:02] (03PS2) 10Clément Goubert: SRELBBatchRunnerBase: Add depool_status attribute [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 [12:04:23] (03CR) 10Clément Goubert: SRELBBatchRunnerBase: Add depool_status attribute (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 (owner: 10Clément Goubert) [12:04:35] (03PS2) 10Clément Goubert: sre.k8s.reboot-nodes: Set pooled=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 [12:05:53] (03PS1) 10Marostegui: db1201: Make it s6 candidate [puppet] - 10https://gerrit.wikimedia.org/r/1058139 (https://phabricator.wikimedia.org/T371361) [12:07:21] (03CR) 10Marostegui: [C:03+2] db1201: Make it s6 candidate [puppet] - 10https://gerrit.wikimedia.org/r/1058139 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [12:08:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1201 T371361', diff saved to https://phabricator.wikimedia.org/P67064 and previous config saved to /var/cache/conftool/dbconfig/20240730-120805-root.json [12:08:11] T371361: A6 and D3 have 3 db masters each - https://phabricator.wikimedia.org/T371361 [12:08:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1201.eqiad.wmnet with reason: Change binlog format [12:08:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1201.eqiad.wmnet with reason: Change binlog format [12:11:04] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 (owner: 10Clément Goubert) [12:11:06] (03PS1) 10Marostegui: db1231: Remove it from candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058140 (https://phabricator.wikimedia.org/T371361) [12:11:38] (03CR) 10Fabfur: [C:03+2] admin: added user Michael Grosse [puppet] - 10https://gerrit.wikimedia.org/r/1058113 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [12:12:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67065 and previous config saved to /var/cache/conftool/dbconfig/20240730-121201-root.json [12:12:07] (03CR) 10Marostegui: [C:03+2] db1231: Remove it from candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058140 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [12:13:45] (03PS3) 10Clément Goubert: sre.k8s.reboot-nodes: Set pooled=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 [12:14:00] (03CR) 10Clément Goubert: sre.k8s.reboot-nodes: Set pooled=inactive (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 (owner: 10Clément Goubert) [12:15:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231 T371361', diff saved to https://phabricator.wikimedia.org/P67066 and previous config saved to /var/cache/conftool/dbconfig/20240730-121500-root.json [12:15:07] T371361: A6 and D3 have 3 db masters each - https://phabricator.wikimedia.org/T371361 [12:15:48] stuck at 12:06:57 sync-masters: 50% (in-flight: 1; ok: 0; fail: 1; left: 0) / now [12:16:33] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1058141 (https://phabricator.wikimedia.org/T371365) [12:16:37] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1058142 (https://phabricator.wikimedia.org/T371365) [12:16:43] !log urbanecm@deploy1003 sync-world aborted: Backport for [[gerrit:1057970|[Growth] hywwiki: Disable Add link backend (T370558)]], [[gerrit:1058117|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]], [[gerrit:1058116|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]] (duration: 14m 10s) [12:16:45] (03CR) 10Brouberol: [C:03+1] trafficserver: add airflow-analytics-test discovery record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [12:16:50] T370558: Disable Add Link backend on hywwiki - https://phabricator.wikimedia.org/T370558 [12:16:50] T371316: refreshLinkRecommendations.php does not run when link-recommendation task type is disabled - https://phabricator.wikimedia.org/T371316 [12:17:06] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1057970|[Growth] hywwiki: Disable Add link backend (T370558)]], [[gerrit:1058117|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]], [[gerrit:1058116|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]] [12:17:53] (03PS1) 10Dreamy Jazz: Grant checkuser-temporary-account-no-preference to suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058143 (https://phabricator.wikimedia.org/T371364) [12:21:57] !log T371253 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=dewiktionary --logwiki=metawiki 'Gregorjohannes' 'Klegul' [12:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:02] T371253: Unblock stuck global rename of Klegul - https://phabricator.wikimedia.org/T371253 [12:22:13] (03PS1) 10Marostegui: db1193: Make it candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/1058144 (https://phabricator.wikimedia.org/T371361) [12:22:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1193 T371361', diff saved to https://phabricator.wikimedia.org/P67068 and previous config saved to /var/cache/conftool/dbconfig/20240730-122243-root.json [12:22:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1193.eqiad.wmnet with reason: Change binlog format [12:22:53] T371361: A6 and D3 have 3 db masters each - https://phabricator.wikimedia.org/T371361 [12:23:00] (03CR) 10Marostegui: [C:03+2] db1193: Make it candidate master for s8 [puppet] - 10https://gerrit.wikimedia.org/r/1058144 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [12:23:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1193.eqiad.wmnet with reason: Change binlog format [12:23:55] (03Abandoned) 10Hashar: (DO NOT SUBMIT) logstash failure [puppet] - 10https://gerrit.wikimedia.org/r/1058088 (owner: 10Hashar) [12:24:33] (03PS1) 10Ayounsi: Netbox 4 regression: add symlink to custom WMF import dir [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058146 [12:24:38] (03PS1) 10Ayounsi: Netbox 4 regression: move WMF import to dedicated folder [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 [12:25:31] (03CR) 10Brouberol: trafficserver: add airflow-analytics-test discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [12:27:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67069 and previous config saved to /var/cache/conftool/dbconfig/20240730-122706-root.json [12:27:23] (03PS1) 10Marostegui: db1192: Remove from candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058148 (https://phabricator.wikimedia.org/T371361) [12:27:40] (03CR) 10Hashar: "recheck after having restored the CI config https://gerrit.wikimedia.org/r/c/integration/config/+/1057881" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [12:27:44] (03CR) 10Hashar: "recheck after having restored the CI config https://gerrit.wikimedia.org/r/c/integration/config/+/1057881" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [12:28:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67070 and previous config saved to /var/cache/conftool/dbconfig/20240730-122825-root.json [12:28:58] (03CR) 10Marostegui: [C:03+2] db1192: Remove from candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058148 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [12:31:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1058149 (https://phabricator.wikimedia.org/T371368) [12:31:21] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1058150 (https://phabricator.wikimedia.org/T371368) [12:38:53] scap is having its days... [12:40:32] `mwdebug1001:/srv/mediawiki/php-1.43.0-wmf.16/index.php` indeed doesn't exist, `mwdebug1002:/srv/mediawiki/php-1.43.0-wmf.16/index.php` does [12:40:49] and deploy1003:/srv/mediawiki-staging/php-1.43.0-wmf.16/index.php does too [12:41:13] !log mwdebug1001: scap pull to overcome scap issues [12:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#10027374 (10elukey) The new ops-limited group is live, just sent an email to all SREs about it. [12:42:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67071 and previous config saved to /var/cache/conftool/dbconfig/20240730-124212-root.json [12:43:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67072 and previous config saved to /var/cache/conftool/dbconfig/20240730-124330-root.json [12:44:03] urbanecm: the train presync failed last night so it's not surprising if some files are missing and your backport is taking longer [12:44:23] but it should have still succeeded after a while, what messages were you seeing from the backport? [12:44:58] jnuche: only `Check 'check_testservers_baremetal' failed:` [12:45:19] and `mwdebug1001:/srv/mediawiki/php-1.43.0-wmf.16` was empty [12:46:17] mmmh, that's also what made the presync fail [12:46:28] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1057951 (owner: 10Dzahn) [12:46:34] it's a recent change in scap, not sure how it actually works, going to investigate a bit [12:46:39] :( [12:46:45] so far, pulling to mwdebug appears to work [12:46:52] but not sure if it is a good idea to sync everywhere [12:46:54] jnuche: thoughts? [12:48:29] jouncebot: next [12:48:29] In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1300) [12:49:00] (03PS1) 10Filippo Giunchedi: search-platform: triage alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1058156 (https://phabricator.wikimedia.org/T354255) [12:49:01] mmmh, if the problem is bare metal hosts missing code, it's possibly related to the problem with the scap installer that I'm trying to fix, because that problem is keeping right now at least one master server from syncing [12:49:25] and the bare metal targets get their code from the masters [12:49:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T371100#10027387 (10phaultfinder) [12:50:19] otoh since we're not using bare for actual prod traffic, then this issue shouldn't really affect the actual deployment, right? [12:50:37] urbanecm: let me try yo fix the master that's still failing, deploy1002, and maybe that will fix the problem [12:50:55] sounds good [12:51:02] ah, if we are not using bare metal for any prod traffic anymore, definitely then [12:51:07] ok, give me a min [12:52:13] (03PS1) 10Fabfur: admin: temporary removed ngkountas duplicate ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1058158 (https://phabricator.wikimedia.org/T371372) [12:52:20] (03PS1) 10Alexandros Kosiaris: Switch wikikube-worker1240-1304 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1058159 (https://phabricator.wikimedia.org/T368933) [12:52:51] (03PS2) 10Stevemunene: trafficserver: add airflow-analytics-test discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) [12:53:17] okay, scap pulling resolved the problem at mwdebug [12:53:19] waiting for jnuche [12:53:41] (03CR) 10Alexandros Kosiaris: [C:03+2] Switch wikikube-worker1240-1304 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1058159 (https://phabricator.wikimedia.org/T368933) (owner: 10Alexandros Kosiaris) [12:54:07] urbanecm: we need to stop the backport before I can update scap in deploy1002 [12:54:17] jnuche: no problem, can do [12:54:24] thx [12:54:25] (03CR) 10Ssingh: [C:03+1] admin: temporary removed ngkountas duplicate ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1058158 (https://phabricator.wikimedia.org/T371372) (owner: 10Fabfur) [12:54:31] jnuche: scap terminated [12:54:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10027442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bull... [12:54:42] (03PS1) 10Klausman: charts/knative-serving: add selector to activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058153 (https://phabricator.wikimedia.org/T365479) [12:55:00] (03CR) 10Stevemunene: trafficserver: add airflow-analytics-test discovery record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [12:55:51] (03CR) 10Hashar: "I have trigerred the second post merge built which was to regenerate the doc after I have moved the CI image from Buster to Bullseye ( htt" [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) (owner: 10Cwhite) [12:55:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1240.eqiad.wmnet with OS bullseye [12:55:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10027449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bullseye... [12:56:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1240.eqiad.wmnet with OS bullseye [12:56:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10027451 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bull... [12:56:46] !log jnuche@deploy1003 Installing scap version "latest" for 2 hosts [12:56:56] !log jnuche@deploy1003 Installation of scap version "latest" completed for 2 hosts [12:57:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67073 and previous config saved to /var/cache/conftool/dbconfig/20240730-125717-root.json [12:57:30] 06SRE, 06Wikidata Integrations Team, 10Wikimedia-Mailing-lists: Create Mailing List: Wikidata for Wikimedia Projects (wikidata-4-wikimedia) - https://phabricator.wikimedia.org/T371078#10027447 (10Ladsgroup) 05Open→03Resolved Done: https://lists.wikimedia.org/postorius/lists/wikidata-for-wikimedia.lis... [12:57:58] ok, scap is fixed in eploy1002 [12:58:06] 06SRE, 06Infrastructure-Foundations, 10netops: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375 (10cmooney) 03NEW p:05Triage→03Low [12:58:11] urbanecm: can you try again? now syncing to that master should succeed [12:58:18] jnuche: let's try! [12:58:24] and hopefully that means all bare metal hosts get also synced correctly [12:58:29] let's see [12:58:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67074 and previous config saved to /var/cache/conftool/dbconfig/20240730-125836-root.json [12:58:42] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1057970|[Growth] hywwiki: Disable Add link backend (T370558)]], [[gerrit:1058117|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]], [[gerrit:1058116|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]] [12:58:53] T370558: Disable Add Link backend on hywwiki - https://phabricator.wikimedia.org/T370558 [12:58:53] T371316: refreshLinkRecommendations.php does not run when link-recommendation task type is disabled - https://phabricator.wikimedia.org/T371316 [12:59:05] (03PS2) 10Filippo Giunchedi: search-platform: triage alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1058156 (https://phabricator.wikimedia.org/T354255) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1300). Please do the needful. [13:00:05] XXBlackburnXx and Gerges: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] Here [13:00:16] o/ im here if anyone needs me [13:00:31] please be patient, we were resolving an issue with scap [13:00:39] i'll start deployment shortly [13:00:55] 06SRE, 06collaboration-services: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#10027486 (10Jelto) Thanks, @Papaul, for opening the task! I just want to make sure I understand the issue correctly, so let me rephrase it: The GitLab service IPs [208.80.153.8/32](ht... [13:01:11] Is scap still down from yesterday? [13:03:07] it's having some issues, we're working to get it back to normal [13:03:28] (03CR) 10Ssingh: "Ready for review." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [13:03:34] jnuche: so far stuck at sync-masters :/ [13:03:42] > 13:01:37 sync-masters: 50% (in-flight: 1; ok: 1; fail: 0; left: 0) / [13:03:58] (03CR) 10Ssingh: "We are already running this version in prod on cp4052 but this is the package build." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [13:04:14] (03CR) 10Brouberol: [C:03+1] trafficserver: add airflow-analytics-test discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [13:04:39] (03PS1) 10Filippo Giunchedi: data-platform: deploy helmfile_admin_ng_pending_changes to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/1058162 (https://phabricator.wikimedia.org/T354255) [13:04:47] urbanecm: could be normal, deploy1002 is syncing the new wmf release from scratch now that scap is working there again [13:04:53] ack [13:05:12] o/ [13:05:15] (watching) [13:08:38] (03PS7) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T359387) [13:08:40] we're now at the checks again [13:08:41] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1058101 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:08:44] fingers crossed [13:09:05] (03CR) 10Alexandros Kosiaris: [C:03+2] Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:09:36] crossed here [13:09:49] (03PS7) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) [13:11:18] (03PS4) 10Pppery: Update nlwiki AbuseFilter config per consensus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 (https://phabricator.wikimedia.org/T370605) (owner: 10XXBlackburnXx) [13:11:53] (03CR) 10Brouberol: [C:03+1] "Thanks, and sorry about the mistake" [alerts] - 10https://gerrit.wikimedia.org/r/1058162 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:12:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67076 and previous config saved to /var/cache/conftool/dbconfig/20240730-131223-root.json [13:12:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1240.eqiad.wmnet with reason: host reimage [13:12:57] !log Started MediaModeration script after it crashed - https://wikitech.wikimedia.org/wiki/MediaModeration [13:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:08] (03PS1) 10Alexandros Kosiaris: Switch all fixtures to mw-parsoid from parsoid-php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058165 (https://phabricator.wikimedia.org/T359387) [13:13:14] !log Started MediaModeration scan on ruwiki to catch-up on monthly limit [13:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:31] !log ruwiki scan is set to time out after 5 hours [13:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67077 and previous config saved to /var/cache/conftool/dbconfig/20240730-131341-root.json [13:14:13] (03Abandoned) 10Alexandros Kosiaris: Switch all fixtures to mw-parsoid from parsoid-php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058165 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:14:55] (03CR) 10Filippo Giunchedi: "No problem! Thank you for the quick review. Once https://phabricator.wikimedia.org/T354762 is fixed we'll be routing the meta-alerts to th" [alerts] - 10https://gerrit.wikimedia.org/r/1058162 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:14:57] (03CR) 10Filippo Giunchedi: [C:03+2] data-platform: deploy helmfile_admin_ng_pending_changes to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/1058162 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:15:01] (03PS2) 10Filippo Giunchedi: data-platform: deploy helmfile_admin_ng_pending_changes to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/1058162 (https://phabricator.wikimedia.org/T354255) [13:15:23] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] data-platform: deploy helmfile_admin_ng_pending_changes to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/1058162 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:15:56] urbanecm: looks like scap didn't bail out yet, did it get past the bare metal checks? [13:16:05] yep, it's at sync-proxies [13:16:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1240.eqiad.wmnet with reason: host reimage [13:16:07] so all looking good [13:16:14] \o/ [13:17:01] (03PS2) 10Ayounsi: Netbox 4 regression: add symlink to custom WMF import dir [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058146 [13:17:49] (03CR) 10Brouberol: "Sorry, it seems that the application isn't dpeloyed." [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [13:19:37] XXBlackburnXx: hi, still around? :) [13:19:45] heey yes [13:19:55] (03CR) 10Urbanecm: [C:03+2] Update nlwiki AbuseFilter config per consensus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 (https://phabricator.wikimedia.org/T370605) (owner: 10XXBlackburnXx) [13:20:00] awesome. merging, will pull to debug soon [13:20:10] XXBlackburnXx: do you know how to use https://wikitech.wikimedia.org/wiki/WikimediaDebug? [13:20:28] yeah im aware [13:20:34] (03Merged) 10jenkins-bot: Update nlwiki AbuseFilter config per consensus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 (https://phabricator.wikimedia.org/T370605) (owner: 10XXBlackburnXx) [13:20:40] okay, good :) [13:21:13] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1057970|[Growth] hywwiki: Disable Add link backend (T370558)]], [[gerrit:1058117|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]], [[gerrit:1058116|refreshLinkRecommendations: Work even when link-recommendation is disabled (T371316)]] (duration: 22m 31s) [13:21:19] T370558: Disable Add Link backend on hywwiki - https://phabricator.wikimedia.org/T370558 [13:21:20] T371316: refreshLinkRecommendations.php does not run when link-recommendation task type is disabled - https://phabricator.wikimedia.org/T371316 [13:21:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10027569 (10Aklapper) [13:21:38] yay! [13:22:01] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1055633|Update nlwiki AbuseFilter config per consensus (T370605)]] [13:22:08] (03CR) 10Stevemunene: "Yes this is to be deployed after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054342 and https://gerrit.wikimedia.org/r/c/operati" [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [13:22:09] T370605: Update nlwiki AbuseFilter config per consensus - https://phabricator.wikimedia.org/T370605 [13:22:36] seems scap is back to normal, neat :) [13:22:40] yep [13:22:47] thanks for the help jnuche [13:23:01] np! [13:24:11] (03CR) 10Alexandros Kosiaris: fixtures: Rename all parsoid-php references (0312 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:24:22] (03PS2) 10Alexandros Kosiaris: fixtures: Rename all parsoid-php references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) [13:24:23] (03PS2) 10Alexandros Kosiaris: Remove parsoid-php certificates from mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) [13:25:45] !log urbanecm@deploy1003 xxblackburnxx, urbanecm: Backport for [[gerrit:1055633|Update nlwiki AbuseFilter config per consensus (T370605)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:25:53] XXBlackburnXx: please test via mwdebug [13:26:39] (03CR) 10DCausse: [C:03+2] search-platform: triage alert lint problems (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1058156 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:26:41] just looked at usergrouprights, seems all fine [13:26:50] sync it :) [13:26:54] !log urbanecm@deploy1003 xxblackburnxx, urbanecm: Continuing with sync [13:26:56] proceeding [13:27:49] (03Merged) 10jenkins-bot: search-platform: triage alert lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1058156 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:28:13] Gerges: still around? :) [13:28:20] Yes [13:28:37] (03CR) 10Alexandros Kosiaris: Remove parsoid-php certificates from mw deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:28:41] (03PS2) 10GergesShamon: [eswiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057018 (https://phabricator.wikimedia.org/T370158) [13:28:42] (03CR) 10JHathaway: [C:03+1] admin: deprecate sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1058101 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:28:42] (03CR) 10Urbanecm: [C:03+2] [eswiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057018 (https://phabricator.wikimedia.org/T370158) (owner: 10GergesShamon) [13:28:46] (03CR) 10Alexandros Kosiaris: [C:03+2] fixtures: Rename all parsoid-php references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67078 and previous config saved to /var/cache/conftool/dbconfig/20240730-132846-root.json [13:28:57] (03PS5) 10GergesShamon: [euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) [13:29:02] (03PS6) 10GergesShamon: [euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) [13:29:04] (03CR) 10Urbanecm: [C:03+2] [euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon) [13:29:17] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove parsoid-php certificates from mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:29:25] (03Merged) 10jenkins-bot: [eswiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057018 (https://phabricator.wikimedia.org/T370158) (owner: 10GergesShamon) [13:29:36] (03PS1) 10Alexandros Kosiaris: Remove parsoid-async from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058166 (https://phabricator.wikimedia.org/T359387) [13:29:38] (03CR) 10Elukey: [C:03+2] admin: deprecate sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1058101 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:29:45] (03CR) 10Urbanecm: [C:03+2] "Since Trizek (CRS on Editing) filled the request, it has sponsorship from Editing. Proceeding." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) (owner: 10GergesShamon) [13:29:56] (03Merged) 10jenkins-bot: [euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon) [13:30:24] (03Merged) 10jenkins-bot: Enable VisualEditor at Spanish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) (owner: 10GergesShamon) [13:30:42] !log deprecate the sre-admins posix group fleetwide (replaced by ops-limited) - T360356 [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:03] T360356: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356 [13:31:36] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1055633|Update nlwiki AbuseFilter config per consensus (T370605)]] (duration: 09m 35s) [13:31:45] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) (owner: 10GergesShamon) [13:31:48] XXBlackburnXx: should be live [13:31:49] T370605: Update nlwiki AbuseFilter config per consensus - https://phabricator.wikimedia.org/T370605 [13:32:00] Gerges: working on your patches now [13:32:01] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1057018|[eswiki] Enable Visual Editor in namespace Project (T370158)]], [[gerrit:1052372|[euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu (T368632)]], [[gerrit:1057196|Enable VisualEditor at Spanish Wikiquote (T355336)]] [13:32:09] T370158: Please activate VisualEditor at Wikipedia namespace in eswiki - https://phabricator.wikimedia.org/T370158 [13:32:09] T368632: Enable VisualEditor at namespaces Wikipedia: and Wikiproiektu: at euwiki - https://phabricator.wikimedia.org/T368632 [13:32:09] T355336: Enable the visual editor at Spanish Wikiquote - https://phabricator.wikimedia.org/T355336 [13:32:10] perfect, thanks [13:32:48] (03Merged) 10jenkins-bot: fixtures: Rename all parsoid-php references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018661 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:32:52] (03Merged) 10jenkins-bot: Remove parsoid-php certificates from mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:33:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:34:22] !log urbanecm@deploy1003 urbanecm, gergesshamon: Backport for [[gerrit:1057018|[eswiki] Enable Visual Editor in namespace Project (T370158)]], [[gerrit:1052372|[euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu (T368632)]], [[gerrit:1057196|Enable VisualEditor at Spanish Wikiquote (T355336)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:28] Gerges: can you test your patches at mwdebug, please? [13:34:40] Ok [13:36:27] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove parsoid-php certificates from mw deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018660 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:36:40] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove parsoid-async from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058166 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:37:16] (03Merged) 10jenkins-bot: Remove parsoid-async from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058166 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:37:20] (03CR) 10Alexandros Kosiaris: [C:03+2] services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [13:38:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10027675 (10Papaul) @Volans thank you. Do we have a long term fix for this because we are having this issue often when we are running the re-image, some nodes are sending... [13:38:58] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#10027676 (10elukey) Everything seems done, I'll leave the task open to wait for questions/feedback/etc.. [13:39:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:39:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1240.eqiad.wmnet with OS bullseye [13:39:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10027682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bullseye... [13:39:42] Gerges: how is it looking? [13:39:48] did you test? [13:39:49] (03CR) 10Elukey: "Hey folks! Any feedback? :)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [13:39:57] Testing... [13:41:17] (03CR) 10Elukey: [C:03+1] P:openldap::management: Add ops-limited to cross validation. [puppet] - 10https://gerrit.wikimedia.org/r/1058081 (https://phabricator.wikimedia.org/T360356) (owner: 10Slyngshede) [13:41:30] (03CR) 10Clément Goubert: [C:03+2] scap: Remove legacy appserver clusters [puppet] - 10https://gerrit.wikimedia.org/r/1058099 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:43:01] All fine, except Spanish Wikiquote [13:43:18] Now all fine :) [13:43:28] sounds good [13:43:29] !log urbanecm@deploy1003 urbanecm, gergesshamon: Continuing with sync [13:43:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67079 and previous config saved to /var/cache/conftool/dbconfig/20240730-134352-root.json [13:46:42] (03PS1) 10Ebernhardson: Add NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058167 (https://phabricator.wikimedia.org/T355267) [13:48:14] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1057018|[eswiki] Enable Visual Editor in namespace Project (T370158)]], [[gerrit:1052372|[euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu (T368632)]], [[gerrit:1057196|Enable VisualEditor at Spanish Wikiquote (T355336)]] (duration: 16m 12s) [13:48:18] Gerges: deployed [13:48:19] anything else? [13:48:26] T370158: Please activate VisualEditor at Wikipedia namespace in eswiki - https://phabricator.wikimedia.org/T370158 [13:48:27] T368632: Enable VisualEditor at namespaces Wikipedia: and Wikiproiektu: at euwiki - https://phabricator.wikimedia.org/T368632 [13:48:27] T355336: Enable the visual editor at Spanish Wikiquote - https://phabricator.wikimedia.org/T355336 [13:48:29] Thanks [13:48:36] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10027700 (10elukey) Found another occurrence of unstaged content on puppetmaster1001: ` root@puppetmaster1001:/var/lib/git/op... [13:49:47] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:51:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10027713 (10Jhancock.wm) replaced the drive with a disk out of a decommed server. I'll leave the ticket open until you confirm all is well from your side. [13:54:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1241.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1242.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1245.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:46] (03PS1) 10Alexandros Kosiaris: restbase: Remove parsoid-php/parsoid-async from listeners [puppet] - 10https://gerrit.wikimedia.org/r/1058168 (https://phabricator.wikimedia.org/T357392) [13:54:48] (03PS1) 10Alexandros Kosiaris: parsoid-php: remove discovery, hosts, dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1058169 (https://phabricator.wikimedia.org/T359387) [13:54:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1245.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1241.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1245.mgmt.eqiad.wmnet with reboot policy FORCED [13:55:05] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10027720 (10elukey) Unstaged all the content, I have also chowned gitpuppet:gitpuppet `.git/index` since it was owned by root... [13:55:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1241-9 - jclark@cumin1002" [13:55:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1245.mgmt.eqiad.wmnet with reboot policy FORCED [13:56:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1241-9 - jclark@cumin1002" [13:56:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:49] (03CR) 10Alexandros Kosiaris: [C:03+2] restbase: Remove parsoid-php/parsoid-async from listeners [puppet] - 10https://gerrit.wikimedia.org/r/1058168 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [13:57:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1241.mgmt.eqiad.wmnet with reboot policy FORCED [13:57:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [13:57:15] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1245.mgmt.eqiad.wmnet with reboot policy FORCED [13:57:42] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1241.mgmt.eqiad.wmnet with reboot policy FORCED [13:57:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [13:58:02] (03CR) 10Fabfur: [C:03+2] admin: temporary removed ngkountas duplicate ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1058158 (https://phabricator.wikimedia.org/T371372) (owner: 10Fabfur) [13:58:16] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T371260#10027726 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:58:18] !log Remove clouddb1021 from zarcillo database T368518 [13:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:25] T368518: decommission clouddb1021 - https://phabricator.wikimedia.org/T368518 [13:58:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10027733 (10Jhancock.wm) [13:58:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10027737 (10Jhancock.wm) 05Open→03Resolved [13:58:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1241.mgmt.eqiad.wmnet with reboot policy FORCED [13:59:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10027754 (10Fabfur) @ngkountas please let me know when ready to submit another SSH key [14:00:34] (03PS22) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [14:00:40] (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:02:59] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:03:45] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10027781 (10jhathaway) >>! In T370783#10026752, @hnowlan wrote: >> How would "incident issue closed" differ? My apologies if these state definitions are listed somewhere. > > I was thinking in terms... [14:03:50] (03PS2) 10Alexandros Kosiaris: parsoid-php: remove discovery, hosts, dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1058169 (https://phabricator.wikimedia.org/T359387) [14:03:50] (03PS1) 10Alexandros Kosiaris: service: Switch parsoid-php to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1058171 (https://phabricator.wikimedia.org/T357392) [14:05:12] (03PS2) 10Ebernhardson: Add NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058167 (https://phabricator.wikimedia.org/T355267) [14:05:12] (03PS3) 10Ebernhardson: beta: Enable NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) [14:05:44] (03PS2) 10Alexandros Kosiaris: service: Switch parsoid-php to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1058171 (https://phabricator.wikimedia.org/T359387) [14:05:46] (03PS3) 10Alexandros Kosiaris: parsoid-php: remove discovery, hosts, dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1058169 (https://phabricator.wikimedia.org/T359387) [14:06:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1241-9 - jclark@cumin1002" [14:06:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1243.mgmt.eqiad.wmnet with reboot policy FORCED [14:06:48] (03CR) 10Alexandros Kosiaris: [C:03+2] service: Switch parsoid-php to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1058171 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [14:07:23] !log jclark@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1241-9 - jclark@cumin1002" [14:07:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=97) [14:07:39] (03CR) 10Clément Goubert: [C:03+1] Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:09:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1246.mgmt.eqiad.wmnet with reboot policy FORCED [14:09:26] (03CR) 10Elukey: "I like this one, but I'd add a README.md under customscripts (if it doesn't break anything) explaining what to up under scripts_imports an" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 (owner: 10Ayounsi) [14:10:16] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes-prod: add KubernetesContainerReachingMemoryLimit exception [alerts] - 10https://gerrit.wikimedia.org/r/1057192 (owner: 10Effie Mouzeli) [14:11:01] (03CR) 10Ayounsi: "Feel free to amend that commit, but I'd prefer to centralize the doc on wikitech rather than spread it around." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 (owner: 10Ayounsi) [14:11:09] (03CR) 10Effie Mouzeli: [C:03+2] mw-on-k8s: update latency expression [alerts] - 10https://gerrit.wikimedia.org/r/1057191 (owner: 10Effie Mouzeli) [14:11:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1247.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:19] (03Merged) 10jenkins-bot: mw-on-k8s: update latency expression [alerts] - 10https://gerrit.wikimedia.org/r/1057191 (owner: 10Effie Mouzeli) [14:12:42] (03CR) 10Clément Goubert: [C:03+2] SRELBBatchRunnerBase: Add depool_status attribute [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 (owner: 10Clément Goubert) [14:12:59] (03Merged) 10jenkins-bot: kubernetes-prod: add KubernetesContainerReachingMemoryLimit exception [alerts] - 10https://gerrit.wikimedia.org/r/1057192 (owner: 10Effie Mouzeli) [14:13:25] (03CR) 10Elukey: [C:03+1] Netbox 4 regression: add symlink to custom WMF import dir [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058146 (owner: 10Ayounsi) [14:14:15] (03CR) 10Elukey: [C:03+1] "Assuming that it was all tested etc.. :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 (owner: 10Ayounsi) [14:14:57] FIRING: ProbeDown: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:10] Expected [14:15:26] !incidents [14:15:26] 4939 (UNACKED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [14:15:27] 4938 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet ulsfo) [14:15:31] !ack 4939 [14:15:31] 4939 (ACKED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [14:15:56] (03CR) 10Slyngshede: [C:03+2] P:openldap::management: Add ops-limited to cross validation. [puppet] - 10https://gerrit.wikimedia.org/r/1058081 (https://phabricator.wikimedia.org/T360356) (owner: 10Slyngshede) [14:15:59] thanks claime [14:16:04] thanks! [14:16:21] (03PS3) 10Ayounsi: Netbox 4 regression: add symlink to custom WMF import dir [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058146 [14:16:21] (03PS1) 10Ayounsi: Update requirements for netbox 4.0.8. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058173 [14:16:27] probes have already been removed btw, this was just a race, I think [14:16:33] (03Merged) 10jenkins-bot: SRELBBatchRunnerBase: Add depool_status attribute [cookbooks] - 10https://gerrit.wikimedia.org/r/1058119 (owner: 10Clément Goubert) [14:16:47] * akosiaris running puppet on alert hosts [14:17:03] (03PS2) 10Slyngshede: IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487) [14:19:05] (03CR) 10Ayounsi: [C:03+2] Netbox 4 regression: move WMF import to dedicated folder [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 (owner: 10Ayounsi) [14:19:12] (03CR) 10Slyngshede: "I'm thinking maybe a quick check with @mvernon@wikimedia.org just to see if there's something obvious that we're missing." [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [14:19:24] jouncebot: nowandnext [14:19:24] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [14:19:24] In 0 hour(s) and 40 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1500) [14:19:57] RESOLVED: ProbeDown: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:06] (03CR) 10Hnowlan: [C:03+2] mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:20:23] (03CR) 10Ayounsi: [V:03+2 C:03+2] Netbox 4 regression: add symlink to custom WMF import dir [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058146 (owner: 10Ayounsi) [14:20:33] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389 (10RobH) 03NEW [14:20:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1245.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:42] (03CR) 10Bking: [C:03+1] charts/knative-serving: add selector to activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058153 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [14:20:49] (03PS1) 10DCausse: team-search: use deriv insteaf of rate for flink metrics [alerts] - 10https://gerrit.wikimedia.org/r/1058176 [14:20:54] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10027886 (10RobH) [14:20:54] (03CR) 10Ayounsi: [V:03+2 C:03+2] Update requirements for netbox 4.0.8. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1058173 (owner: 10Ayounsi) [14:20:55] !log jnuche@deploy1003 Installing scap version "latest" for 3 hosts [14:21:01] (03Merged) 10jenkins-bot: mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:21:38] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.reboot-nodes: Set pooled=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 (owner: 10Clément Goubert) [14:21:39] (03PS2) 10DCausse: team-search: use deriv instead of rate for flink metrics [alerts] - 10https://gerrit.wikimedia.org/r/1058176 [14:21:45] !log jnuche@deploy1003 Installing scap version "latest" for 2 hosts [14:21:54] !log jnuche@deploy1003 Installation of scap version "latest" completed for 2 hosts [14:22:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1242.mgmt.eqiad.wmnet with reboot policy FORCED [14:22:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1241.mgmt.eqiad.wmnet with reboot policy FORCED [14:22:59] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:23:39] (03CR) 10Klausman: [C:03+2] charts/knative-serving: add selector to activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058153 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [14:23:50] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10027913 (10RobH) a:03MatthewVernon @matthewvernon, Please note there has been a slight change in the workflow for racking and installing hosts. The DC ops te... [14:24:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1249.mgmt.eqiad.wmnet with reboot policy FORCED [14:24:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1248.mgmt.eqiad.wmnet with reboot policy FORCED [14:24:49] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#10027930 (10jhathaway) @Xover apologies for the radio silence on this issue, have you seen any new occurrences? [14:24:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1248.mgmt.eqiad.wmnet with reboot policy FORCED [14:25:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [14:25:16] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [14:25:16] (03PS8) 10Hnowlan: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [14:25:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1248.mgmt.eqiad.wmnet with reboot policy FORCED [14:25:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1243.mgmt.eqiad.wmnet with reboot policy FORCED [14:26:48] (03Merged) 10jenkins-bot: charts/knative-serving: add selector to activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058153 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [14:26:49] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:27:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1246.mgmt.eqiad.wmnet with reboot policy FORCED [14:29:26] (03CR) 10Hnowlan: [C:03+2] mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:30:18] (03Merged) 10jenkins-bot: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:31:08] (03Merged) 10jenkins-bot: Netbox 4 regression: move WMF import to dedicated folder [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 (owner: 10Ayounsi) [14:31:09] (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Set pooled=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/1058120 (owner: 10Clément Goubert) [14:33:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1247.mgmt.eqiad.wmnet with reboot policy FORCED [14:33:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host pc1017.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:35:22] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [14:35:23] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [14:35:28] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [14:36:15] 06SRE, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#10028048 (10cmooney) Ultimately on these hosts those IPs are bridged onto the main interface, and both source and answer ARP requests on the wider (/27 and /64) networks they come... [14:36:37] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [14:36:38] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [14:36:53] 06SRE, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#10028050 (10cmooney) a:05Jelto→03cmooney [14:37:23] (03CR) 10Hnowlan: [C:03+2] shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:37:30] (03CR) 10CI reject: [V:04-1] shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:38:51] (03PS4) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) [14:39:12] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 (10RobH) 03NEW [14:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:32] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10028092 (10RobH) [14:40:35] (03PS5) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) [14:40:41] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10028109 (10RobH) a:03MatthewVernon @matthewvernon, Please note there has been a slight change in the workflow for racking and installing hosts. The DC ops tea... [14:42:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1017.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:42:58] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host pc2017.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:43:00] (03PS6) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) [14:45:33] (03CR) 10BCornwall: [V:03+1 C:03+1] "Double-checked against netbox. Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1056521 (https://phabricator.wikimedia.org/T370891) (owner: 10Cathal Mooney) [14:45:53] !log mwmaint1002: mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=enwiki --all --verbose (T370802; log kept at mwmaint1002:/home/urbanecm/revalidateLinkRecommendations-T370802-july-2024.log) [14:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:58] T370802: Add a link (Structured task): Release as "turned off" to English Wikipedia - https://phabricator.wikimedia.org/T370802 [14:47:19] !log mforns@deploy1003 Started deploy [airflow-dags/analytics@e1fdaac]: (no justification provided) [14:47:24] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10028159 (10MatthewVernon) a:05MatthewVernon→03None Already done (via [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055254 | this change ]]), thanks. [14:47:35] !log mforns@deploy1003 Finished deploy [airflow-dags/analytics@e1fdaac]: (no justification provided) (duration: 00m 15s) [14:47:51] !log mforns@deploy1003 Started deploy [airflow-dags/analytics@e1fdaac]: (no justification provided) [14:48:18] !log mforns@deploy1003 Finished deploy [airflow-dags/analytics@e1fdaac]: (no justification provided) (duration: 00m 26s) [14:48:50] (03CR) 10Hnowlan: [C:03+2] shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:48:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10028163 (10MatthewVernon) a:05MatthewVernon→03None Already done via [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055254 | this change ]]... [14:50:39] (03Merged) 10jenkins-bot: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:51:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: upgrading anycast-hc: T370068] [14:51:10] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:51:20] (03CR) 10DCausse: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [14:51:34] !log [dns7001] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068 [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:29] (03CR) 10Volans: "post-merge -1" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058147 (owner: 10Ayounsi) [14:56:57] (03PS1) 10Hnowlan: shellbox-video: set idle_upstream_timeout to 1 day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058186 (https://phabricator.wikimedia.org/T356241) [14:57:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10028206 (10MatthewVernon) Thanks, looking good from here :) [14:58:02] (03CR) 10Volans: [C:03+1] "cookbook wise looks ok, I'll leave the details of the warmup.py options to the experts :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [14:58:03] (03PS1) 10Ssingh: Remove parsoid-php discovery record [dns] - 10https://gerrit.wikimedia.org/r/1058189 (https://phabricator.wikimedia.org/T359387) [14:58:27] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [14:58:32] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [14:59:00] (03CR) 10Kamila Součková: [C:03+1] shellbox-video: set idle_upstream_timeout to 1 day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058186 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:59:22] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2017.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:00:03] (03CR) 10Hnowlan: [C:03+2] shellbox-video: set idle_upstream_timeout to 1 day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058186 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1500). [15:00:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: finished upgrading anycast-hc: T370068] [15:00:30] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [15:00:56] (03Merged) 10jenkins-bot: shellbox-video: set idle_upstream_timeout to 1 day [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058186 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:01:53] (03CR) 10MVernon: [C:03+1] "Hi," [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [15:03:40] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [15:03:45] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [15:04:40] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [15:09:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275 [15:09:13] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [15:10:14] 06SRE, 10[DEPRECATED] wdwb-tech, 10Beta-Cluster-Infrastructure, 06serviceops, 10Wikidata: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#10028314 (10Krinkle) @Urbanecm_WMF Do you mind uploading it to Gerrit under 06SRE, 10Beta-Cluster-Infrastructure, 06serviceops, 10Wikidata, 10wmde-wikidata-tech: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#10028317 (10Krinkle) [15:15:24] (03CR) 10Slyngshede: "Well, I had hope you'd know what happened with the +x, but I won't complain :-)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [15:15:42] (03PS1) 10Ayounsi: python_deploy_venv: update submodules URL in case it's needed [puppet] - 10https://gerrit.wikimedia.org/r/1058193 [15:15:52] (03PS4) 10Giuseppe Lavagetto: wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 [15:15:52] (03PS1) 10Giuseppe Lavagetto: scap::user: define the correct shell [puppet] - 10https://gerrit.wikimedia.org/r/1058194 [15:16:21] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove parsoid-php discovery record [dns] - 10https://gerrit.wikimedia.org/r/1058189 (https://phabricator.wikimedia.org/T359387) (owner: 10Ssingh) [15:16:40] (03PS2) 10Giuseppe Lavagetto: scap::user: define the correct shell [puppet] - 10https://gerrit.wikimedia.org/r/1058194 [15:17:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10028353 (10Marostegui) @ABran-WMF please coordinate with @cmooney for this. [15:18:49] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058194 (owner: 10Giuseppe Lavagetto) [15:19:17] (03PS1) 10Joely Rooke WMDE: Fix tracking parameter casing [extensions/Wikibase] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058196 (https://phabricator.wikimedia.org/T370045) [15:20:45] (03CR) 10Giuseppe Lavagetto: [C:03+2] scap::user: define the correct shell [puppet] - 10https://gerrit.wikimedia.org/r/1058194 (owner: 10Giuseppe Lavagetto) [15:20:50] !log restart pybal for parsoid-php removal on lvs1020, lvs2014 T359387 [15:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:55] T359387: Cleanup parsoid-php service - https://phabricator.wikimedia.org/T359387 [15:22:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058196 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [15:25:40] (03PS2) 10Clément Goubert: Revert "httpbb: Add tests for new redirects" [puppet] - 10https://gerrit.wikimedia.org/r/1032145 [15:27:22] (03PS1) 10Gerrit Patch Uploader: [LOCAL HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) [15:27:22] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader) [15:34:17] 06SRE, 10Beta-Cluster-Infrastructure, 06serviceops, 10Wikidata, 10wmde-wikidata-tech: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#10028409 (10Urbanecm_WMF) >>! In T125976#10028314, @Krinkle wrote: > @Urbanecm_WMF Do you mind uploading it to Gerrit under... [15:43:55] (03PS4) 10Elukey: Release version 0.5.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) [15:43:57] (03CR) 10RLazarus: "Hi, Puppet request window SRE here. :) This is a little too complex to rubber-stamp in the Puppet window -- it should get some close revie" [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [15:44:20] (03CR) 10Elukey: Release version 0.5.0-1 (033 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [15:45:09] (03CR) 10Elukey: "For some reason gerrit fails to apply all the suggested edits, worst case I'll amend manually, thanks!" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [15:46:20] jouncebot: nowandnext [15:46:20] For the next 0 hour(s) and 13 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1500) [15:46:20] In 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1600) [15:47:00] !log jnuche@deploy1003 Installing scap version "latest" for 2 hosts [15:47:07] (03PS5) 10Elukey: Release version 0.5.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) [15:47:12] !log jnuche@deploy1003 Installation of scap version "latest" completed for 2 hosts [15:47:25] (03CR) 10Elukey: "Should be done!" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [15:48:56] !log jnuche@deploy1003 Installing scap version "latest" for 214 hosts [15:49:22] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:33] !log jnuche@deploy1003 Installing scap version "latest" for 213 hosts [15:50:09] !log jnuche@deploy1003 Installation of scap version "latest" completed for 213 hosts [15:50:22] (03CR) 10Clément Goubert: [C:03+2] Revert "httpbb: Add tests for new redirects" [puppet] - 10https://gerrit.wikimedia.org/r/1032145 (owner: 10Clément Goubert) [15:51:25] (03CR) 10Ssingh: "Hi: Thanks for the patch! The CSP itself has been reviewed so we can skip that but for the varnish stuff, I think we should add some tests" [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [15:52:07] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10028462 (10Jhancock.wm) a:03Jhancock.wm [15:52:45] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10028469 (10Jhancock.wm) a:03Jhancock.wm [15:56:39] !log restart pybal for parsoid-php removal on lvs1019, lvs2013 T359387 [15:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:49] T359387: Cleanup parsoid-php service - https://phabricator.wikimedia.org/T359387 [16:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:01] (03PS1) 10Ayounsi: Replace is_private() with ip.is_ipv4_private_use() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058208 [16:01:34] (03PS2) 10Ayounsi: Replace is_private() with ip.is_ipv4_private_use() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058208 [16:04:09] (03CR) 10KOfori: [C:03+1] admin: add Kwaku as approver for the dns-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1053352 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [16:06:46] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:06:59] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:07:21] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:07:23] (03PS1) 10Fabfur: admin: added migr user to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/1058209 (https://phabricator.wikimedia.org/T371010) [16:07:57] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:08:26] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:09:06] (03CR) 10Ssingh: [C:03+1] admin: added migr user to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/1058209 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [16:09:14] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:09:27] (03CR) 10Fabfur: [C:03+2] admin: added migr user to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/1058209 (https://phabricator.wikimedia.org/T371010) (owner: 10Fabfur) [16:10:10] (03PS3) 10Dzahn: lists: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) [16:10:40] (03CR) 10EoghanGaffney: [C:03+1] lists: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:11:56] (03CR) 10Dzahn: [C:03+2] lists: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:13:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 2.026s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:18:12] (03CR) 10Elukey: [C:03+1] Replace is_private() with ip.is_ipv4_private_use() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058208 (owner: 10Ayounsi) [16:18:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 6.229s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:24:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#10028582 (10jhathaway) >>! In T367399#9909402, @MoritzMuehlenhoff wrote: > Did one of thes... [16:33:39] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#10028606 (10jhathaway) p:05High→03Medium I briefly chatted with @hashar about this tas... [16:35:27] We have a wmf.16 blocker in T371376 which we're backporting a new parsoid to fix [16:35:28] T371376: Linter related error on PCS tests: Cannot use object of type stdClass as array - https://phabricator.wikimedia.org/T371376 [16:37:29] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.20.0-a16 [vendor] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058211 (https://phabricator.wikimedia.org/T371376) [16:38:57] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.20.0-a16 [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058212 (https://phabricator.wikimedia.org/T371126) [16:41:03] cscott: let me know if you'd need any help with deploying the backports [16:41:06] should i go ahead an C+2 these cherry-picks, or let an SRE do that?  I want to make sure code is re-synced before the train deploy. [16:41:16] the train hasn't rolled out to group0 yet, right? [16:41:33] cscott: it rolled out to testwiki (which is a "minus one" group) [16:41:46] so you need to either sync it, or have someone else do it for you [16:45:56] thanks ihurbain for poking me about my irc drop [16:46:16] cscott: did you get my messages? or should i resend? [16:46:38] resend please, sorry! [16:47:03] (03CR) 10Dzahn: "thanks for merging! I confirm I see the throttling chain and it's just "log accept"." [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [16:47:19] (03CR) 10C. Scott Ananian: [C:03+2] Bump wikimedia/parsoid to 0.20.0-a16 [vendor] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058211 (https://phabricator.wikimedia.org/T371376) (owner: 10C. Scott Ananian) [16:48:00] cscott: i was saying that wmf.16 rolled out to testwiki already, so "just merge" isn't sufficient at this point. you either need to deploy the code to production, or have someone else to do that [16:48:08] i'm wondering whether you want to do that yourself, or whether i can be of any help. [16:49:32] well, i managed to confuse gerrit by cherry-picking the new patches before the original patches to master had merged. [16:50:00] since the core patch Depends-On the vendor patch, and the cherry-picked patches have the same Change-Id as the original, zuul seems to be a bit confused [16:50:20] so let me first baby sit zuul and make sure the patches merge to wmf.16 [16:50:46] once that's done, i could use help with the deploy yes.  either to Just Do It, or to hold my hand since I haven't done it in a long while. [16:52:10] feel free to ping me :) [16:59:04] (03CR) 10Ottomata: [C:03+2] mediawiki.org - Rewrite /beacon/event -> EventLogging rest handler [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1700) [17:05:39] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.20.0-a16 [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058212 (https://phabricator.wikimedia.org/T371126) (owner: 10C. Scott Ananian) [17:07:43] (03PS1) 10Hashar: wm-pcc: separate v5 and v7 in two runs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1058219 (https://phabricator.wikimedia.org/T371407) [17:10:05] (03CR) 10C. Scott Ananian: [C:03+2] "recheck" [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058212 (https://phabricator.wikimedia.org/T371126) (owner: 10C. Scott Ananian) [17:12:56] !log adding row C/D vlans to lsw1-b2-codfw and adding on trunk to lvs2012 T370862 [17:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:01] T370862: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862 [17:13:25] !log otto@deploy1003 Started scap sync-world: mediawiki.org - Apache Rewrite /beacon/event -> EventLogging rest handler - T353817 [17:13:30] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [17:15:23] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.20.0-a16 [vendor] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058211 (https://phabricator.wikimedia.org/T371376) (owner: 10C. Scott Ananian) [17:17:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [17:17:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [17:18:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10028892 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=22b0edee-c7a6-4b0f-9fea-2095ec62... [17:18:25] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-b2-codfw.mgmt with reason: reconfigure vlans on lvs2012 [17:18:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-b2-codfw.mgmt with reason: reconfigure vlans on lvs2012 [17:18:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10028893 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6ff7dee3-4248-4c63-812a-befb7aa3... [17:18:56] !log otto@deploy1003 Finished scap: mediawiki.org - Apache Rewrite /beacon/event -> EventLogging rest handler - T353817 (duration: 05m 56s) [17:19:01] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [17:20:32] !log disable BGP to PyBal on lvs2012 from lsw1-b2-codfw (moving traffic to lvs2014) [17:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:42] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10028913 (10Fabfur) 05In progress→03Resolved User confirmed can access logs now [17:29:33] (03PS1) 10Volans: mysql_legacy: instance improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) [17:35:53] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.20.0-a16 [core] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058212 (https://phabricator.wikimedia.org/T371126) (owner: 10C. Scott Ananian) [17:37:59] (03CR) 10Hashar: [C:03+2] wm-pcc: separate v5 and v7 in two runs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1058219 (https://phabricator.wikimedia.org/T371407) (owner: 10Hashar) [17:38:29] (03Merged) 10jenkins-bot: wm-pcc: separate v5 and v7 in two runs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1058219 (https://phabricator.wikimedia.org/T371407) (owner: 10Hashar) [17:42:38] urbanecm: it looks like everything managed to merge and zuul is happy.  i'm going to grab a quick lunch and then maybe you can help me deploy the merged patches? [17:49:47] (03CR) 10Cathal Mooney: [C:03+2] lvs2012: move row C & D vlans to primary uplink and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056478 (https://phabricator.wikimedia.org/T370862) (owner: 10Cathal Mooney) [17:50:55] !log hashar@deploy1003 Started deploy [gerrit/gerrit@40e4e0f]: wm-pcc: separate v5 and v7 in two runs - T371407 [17:51:00] T371407: wmf-pcc: Puppet compiler integration is confusing when Puppet 5 support is dropped - https://phabricator.wikimedia.org/T371407 [17:51:05] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@40e4e0f]: wm-pcc: separate v5 and v7 in two runs - T371407 (duration: 00m 09s) [17:51:56] (03PS2) 10C. Scott Ananian: Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) [17:55:19] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-b2-codfw.mgmt with reason: reconfigure vlans on lvs2012 [17:55:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-b2-codfw.mgmt with reason: reconfigure vlans on lvs2012 [17:55:32] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [17:55:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10028999 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dd309020-6739-44e3-aae7-1db7e069... [17:55:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [17:55:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029001 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e014b03e-5922-4caa-80c4-c950cc41... [17:56:24] !log rebooting lvs2012 to force new network config T370862 [17:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:29] T370862: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862 [17:59:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [18:00:05] brennen and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T1800). [18:00:24] o/ [18:00:52] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#10029018 (10hashar) From T371407: The red chipset now has the v5 mention (v7 passes and i... [18:01:09] brennen: fwiw, cscott merged sth to wmf.16 some time ago [18:01:43] urbanecm: ack, thanks. [18:02:07] merged but not deployed, i gather. [18:02:26] correct [18:05:01] sorry, web.libera.chat keeps disconnecting me :( [18:05:12] wb cscott [18:05:44] !log Stopped MediaModeration scanning script on ruwiki [18:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:50] the parsoid backport is merged in git, but not yet deployed.  my understanding is it should be deployed to test wiki, and then we can proceed with the group0 deploy? [18:06:04] cscott: yeah, i can go ahead and do that. [18:06:59] ok, thanks!  i'll be on line if I can help at all. [18:07:33] cscott: the web thing will do that. You should probably look at an actual irc client. I think WMF used to somewhere have a team irccloud subscription. [18:07:52] Although free will keep you connected 2 hours [18:08:26] 10ops-eqiad, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416 (10RobH) 03NEW [18:08:47] 10ops-eqiad, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10029059 (10RobH) [18:09:04] cscott: RhinosF1 is right, you can request irccloud.com subscribtion from WMF if you want [18:09:35] yeah, in the good old days I ran Pidgin and all my messages we were in the same client.  I was using the matrix-IRC bridge for a while but it caused no end of headaches.  Also the fact that the official matrix client can't handle the fact that the IRC bridge and the Slack bridge required different logins was annoying. [18:09:47] urbanecm: I assume you probably know how he does that cause I can tell you nothing more than it exists [18:10:01] cscott: matrix was a mess [18:10:28] i'm holding out hope for a unified solution for matrix/slack/irc.  in theory 'fixing all our chats' is an APP goal this year, right? [18:10:39] 10ops-eqiad, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10029062 (10RobH) a:03Marostegui @Marostegui, Please note there has been a slight change in the workflow for racking and installing hosts. The DC ops team, a... [18:10:46] cscott: RhinosF1: https://office.wikimedia.org/wiki/ITS/IRCCloud [18:10:48] cscott: lol, sob. [18:10:51] Matrix should probably just not exist [18:10:56] s7 is unahppy, I know why [18:10:59] give me a bit [18:11:10] Amir1: k, holding on any deploys. [18:11:28] I'm getting an error page on en is that known? [18:11:49] i'm getting an error page on office, assuming same root cause [18:11:57] FIRING: [8x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:31] ... the particular page I was trying to load is now loading properly [18:12:33] (03PS1) 10Cathal Mooney: Remove vlan sub-interface for private1-b2-codfw on lvs2012 [puppet] - 10https://gerrit.wikimedia.org/r/1058230 [18:12:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P67082 and previous config saved to /var/cache/conftool/dbconfig/20240730-181242-ladsgroup.json [18:13:09] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [18:13:12] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [18:13:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:13:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029064 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a53d3f9e-80ae-429e-b814-01f035f8... [18:13:29] hmm [18:13:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool db1174', diff saved to https://phabricator.wikimedia.org/P67083 and previous config saved to /var/cache/conftool/dbconfig/20240730-181331-ladsgroup.json [18:13:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:13:55] (03CR) 10Cathal Mooney: [C:03+2] Remove vlan sub-interface for private1-b2-codfw on lvs2012 [puppet] - 10https://gerrit.wikimedia.org/r/1058230 (owner: 10Cathal Mooney) [18:13:55] it should be all over now [18:13:59] is it fixed? [18:14:22] FIRING: [10x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:31] errors / latency subsiding on mw-api-ext [18:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.002s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:15:39] RESOLVED: [10x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:16:19] things should recover now [18:16:54] Thanks for the quick response [18:16:57] RESOLVED: [10x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:05] Same issue or something else? [18:17:57] no, it was me. I deployed a schema change on a tiny table but replicas got lagged because every read in mediawiki on replicas does "transaction" to get consistent read and they were locking the table [18:18:12] so replication couldn't continue [18:18:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:18:31] aha. Well, thanks again for catching it [18:18:41] sorry for breaking things [18:18:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:19:18] I will never understand if there is an actual reason mediawiki locks rows on select on replicas [18:19:44] sounds like a Kr.inkle question maybe [18:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.002s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:20:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:21:34] good for deploys at this point, then? [18:21:53] brennen: yes please [18:21:55] sorry for this [18:22:02] no worries, thanks! [18:22:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:35] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1058211|Bump wikimedia/parsoid to 0.20.0-a16 (T371376 T371126)]] [18:22:43] (03PS1) 10Cathal Mooney: Add missing vlan names to balancer.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1058232 (https://phabricator.wikimedia.org/T370635) [18:23:13] T371376: Linter related error on PCS tests: Cannot use object of type stdClass as array - https://phabricator.wikimedia.org/T371376 [18:23:21] T371126: CTT tasks week of 2024-07-26 - https://phabricator.wikimedia.org/T371126 [18:24:18] (03CR) 10Cathal Mooney: [C:03+2] Add missing vlan names to balancer.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1058232 (https://phabricator.wikimedia.org/T370635) (owner: 10Cathal Mooney) [18:24:49] !log brennen@deploy1003 brennen, cscott: Backport for [[gerrit:1058211|Bump wikimedia/parsoid to 0.20.0-a16 (T371376 T371126)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:26:19] !log brennen@deploy1003 brennen, cscott: Continuing with sync [18:27:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:27:12] FIRING: [13x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:27:27] RESOLVED: [13x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:27:40] !log rebooting lvs2012 (again) to force new network config T370862 [18:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:59] T370862: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862 [18:28:55] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [18:29:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:15] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-b2-codfw.mgmt with reason: reconfigure vlans on lvs2012 [18:29:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-b2-codfw.mgmt with reason: reconfigure vlans on lvs2012 [18:29:43] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [18:29:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs2012.codfw.wmnet with reason: reconfigure vlans on lvs2012 [18:30:12] (03PS3) 10Slyngshede: IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487) [18:30:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029125 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=428f84f9-4ca7-4d64-ba2f-941c3927... [18:30:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029126 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fea7df87-a776-4ad1-b5ea-1c4c47a6... [18:31:29] !log brennen@deploy1003 Finished scap: Backport for [[gerrit:1058211|Bump wikimedia/parsoid to 0.20.0-a16 (T371376 T371126)]] (duration: 08m 54s) [18:31:35] T371376: Linter related error on PCS tests: Cannot use object of type stdClass as array - https://phabricator.wikimedia.org/T371376 [18:31:37] T371126: CTT tasks week of 2024-07-26 - https://phabricator.wikimedia.org/T371126 [18:32:12] FIRING: [7x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:27] RESOLVED: [6x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:44] brennen: arlo tested that the parsoid backport works on testwiki, so we're good. [18:33:11] cscott: thanks, going ahead to group0 [18:33:50] !log 1.43.0-wmf.16 train (T366961): blockers resolved, rolling to group0 [18:33:54] cscott: https://blog.irccloud.com/slack-integration/ [18:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:56] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [18:34:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:17] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058234 (https://phabricator.wikimedia.org/T366961) [18:34:21] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058234 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [18:35:23] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058234 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [18:36:52] (03CR) 10JHathaway: [C:03+1] IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [18:39:20] !log re-enabling BGP to lvs2012 from lsw1-b2-codfw T370862 [18:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:25] T370862: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862 [18:44:23] (03PS2) 10Ryan Kemper: wdqs: allow internal federation btw main&scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1057878 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [18:45:24] i'm getting a failure here for check_testservers_baremetal [18:45:53] 3 (of 130) requests with failed assertions [18:46:54] https://phabricator.wikimedia.org/P67084 [18:47:16] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allow internal federation btw main&scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1057878 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [18:51:02] (03PS7) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [18:51:17] seems to have been transient; no errors on a repeat check. [18:51:51] (03PS8) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [18:53:18] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.16 refs T366961 [18:53:23] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [18:55:20] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10029199 (10Dwisehaupt) 05Open→03Resolved Built out pay-lb2001 and pay-lb2002. Going to close this task as all machines are in working order. Any follow on work and repla... [18:56:19] (03PS9) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [18:59:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10029209 (10cmooney) Work on this one is completed, all that remains is to remove the old cross-rack links which ar... [19:00:16] (03PS10) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [19:00:23] (03CR) 10TheDJ: "i have no idea how to do that, so if anyone else wants to, please go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [19:01:55] (03PS11) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [19:04:19] (03CR) 10Ryan Kemper: wdqs: add main and scholarly role assignments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [19:04:25] 10ops-eqiad, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422 (10RobH) 03NEW p:05Triage→03Medium [19:04:26] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [19:04:26] 10ops-codfw, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423 (10RobH) 03NEW p:05Triage→03Medium [19:11:50] (03PS2) 10Cathal Mooney: lvs2011: move row C & D vlans to primary uplink and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056521 (https://phabricator.wikimedia.org/T370891) [19:13:35] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: 0.3.145 [19:19:17] web.libera.chat keeps logging me off :( i'm still around for the config deploy if that's happening. sorry i've lost some context. [19:21:35] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: 0.3.145 (duration: 07m 59s) [19:28:27] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lsw1-a2-codfw.mgmt with reason: reconfigure vlans on lvs2011 [19:28:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lsw1-a2-codfw.mgmt with reason: reconfigure vlans on lvs2011 [19:28:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10029484 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fdb9ae19-db19-42c1-a837-d30eff23... [19:29:03] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2011.codfw.wmnet with reason: reconfigure vlans on lvs2011 [19:29:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2011.codfw.wmnet with reason: reconfigure vlans on lvs2011 [19:29:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10029498 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0cfea209-8c6a-4d44-8fbf-96f5cd79... [19:29:45] !log disable BGP to lvs2011 on lsw1-a2-codfw (moves traffic to lvs2014) in advnace of vlan change T370891 [19:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:57] T370891: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891 [19:30:25] FIRING: SystemdUnitFailed: apache2.service on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:11] (03Abandoned) 10Kgraessle: When user is reverted by Automoderator, send them a talk page message - one last non primitive data type left [extensions/AutoModerator] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056214 (https://phabricator.wikimedia.org/T355930) (owner: 10Kgraessle) [19:32:48] (oh i'm still 8 minutes early for the backport window) [19:34:23] it's still train time :) [19:43:34] (03PS12) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [19:43:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10029621 (10RobH) [19:43:42] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [19:44:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10029625 (10RobH) [19:45:53] (03CR) 10Dzahn: "the new queries take quite some time to run, i'd like to find out what is considered too long first" [puppet] - 10https://gerrit.wikimedia.org/r/1056992 (https://phabricator.wikimedia.org/T370947) (owner: 10Aklapper) [19:49:22] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:44] (03CR) 10Cathal Mooney: [C:03+2] lvs2011: move row C & D vlans to primary uplink and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056521 (https://phabricator.wikimedia.org/T370891) (owner: 10Cathal Mooney) [19:55:46] (03PS2) 10Dzahn: phabricator: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) [19:58:40] !log rebooting lvs2011 to force new network config T370891 [19:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:52] T370891: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891 [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240730T2000). [20:00:04] cjming, ebernhardson, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hi - i can deploy [20:00:34] cjming: sounds good. can you ping me once done? [20:00:44] urbanecm: sure thing! [20:01:07] ebernhardson: are you around? i was thinking to do mine last [20:01:23] cscott: are you around? [20:01:51] i guess maybe i'll do mine first then [20:01:55] cjming: yup [20:02:09] oh - cool [20:02:17] (03PS3) 10Ebernhardson: Add NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058167 (https://phabricator.wikimedia.org/T355267) [20:02:22] i'm around [20:02:27] cjming: i'm not sure if mine triggers l10n or not? I know it has something to do with ensuring those get built properly but not sure if deploying it causes the rebuild [20:02:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058167 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [20:03:12] (03CR) 10Dzahn: [C:03+2] Include tags and subscibers in quarterly Phabricator data for WMF QLS [puppet] - 10https://gerrit.wikimedia.org/r/1056992 (https://phabricator.wikimedia.org/T370947) (owner: 10Aklapper) [20:03:27] (03Merged) 10jenkins-bot: Add NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058167 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [20:03:46] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1058167|Add NetworkSession extension (T355267)]] [20:03:51] ebernhardson: i think so? i just went thru this with our MP extension - had to let it propagate for 2 weeks [20:03:53] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [20:04:35] ebernhardson: so i'll just sync when the time comes [20:05:37] cjming: alrighty, i suppoes i only mention because l10n rebuilds can take awhile [20:06:14] !log re-enable BGP to lvs2011 on lsw1-a2-codfw (restores as primary for traffic) T370891 [20:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:20] T370891: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891 [20:07:27] (03CR) 10Dzahn: [C:03+2] admin: add Kwaku as approver for the dns-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1053352 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [20:07:30] hmm - i guess we'll see - i see messages saying 500+ languages have been rebuilt in the push to test servers [20:08:00] (03CR) 10Dzahn: [C:03+2] "thanks Kwaku" [puppet] - 10https://gerrit.wikimedia.org/r/1053352 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [20:08:29] cjming: yes, modifications of extension-list do cause an i18n rebuild [20:08:50] (03CR) 10Dzahn: [C:03+2] "query run time still considered acceptable. you should have a test mail, Andre" [puppet] - 10https://gerrit.wikimedia.org/r/1056992 (https://phabricator.wikimedia.org/T370947) (owner: 10Aklapper) [20:09:07] gtk! [20:09:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10029715 (10cmooney) Work completed on this one on the network & LVS side. @papaul we can now remove the cross-rac... [20:09:26] (we might be waiting for quite some time by now) [20:09:39] oh - any idea about time? [20:10:53] cjming: half an hour or possibly even more [20:11:08] 😬 [20:12:16] sorry :( i should have brought it up earlier. It could have been last [20:13:31] no worries from my side - live/learn -- sorry to cscott tho [20:16:31] !log bounce benthos@webrequest_live.service on centrallog for excessive lag [20:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:32] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#10029741 (10jhathaway) >>! In T367399#10029018, @hashar wrote: > What I am wondering is: i... [20:23:31] (03PS5) 10Cathal Mooney: lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) [20:27:20] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434 (10RobH) 03NEW [20:27:22] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435 (10RobH) 03NEW [20:27:32] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10029796 (10RobH) [20:27:45] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10029801 (10RobH) [20:31:29] (03PS3) 10Cathal Mooney: lvs2013: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056563 (https://phabricator.wikimedia.org/T370927) [20:33:15] (03CR) 10Dzahn: [V:04-1] "ahh.. still an issue. this is the part I mentioned when we have to set "srange" in Hiera: see https://puppet-compiler.wmflabs.org/output/" [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:33:24] (03Abandoned) 10Cathal Mooney: Adjust LVS config in esams, drmrs to peer bit both ASWs [puppet] - 10https://gerrit.wikimedia.org/r/1020844 (https://phabricator.wikimedia.org/T362772) (owner: 10Cathal Mooney) [20:38:07] !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1058167|Add NetworkSession extension (T355267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:38:12] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [20:38:28] ebernhardson: i'm going to sync [20:39:14] (03PS3) 10Dzahn: phabricator: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) [20:40:00] !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync [20:43:02] woo, got there eventually :) [20:45:25] RESOLVED: SystemdUnitFailed: apache2.service on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:55] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1058167|Add NetworkSession extension (T355267)]] (duration: 45m 08s) [20:48:59] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [20:49:11] nice - finally finished! [20:49:25] (03PS4) 10Dzahn: phabricator: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) [20:49:25] cscott: are you still around? sorry for the wait [20:49:55] (03PS3) 10C. Scott Ananian: Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) [20:51:42] yup i'm here! [20:51:50] cool [20:51:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [20:52:22] (03CR) 10Dzahn: [V:03+1 C:03+1] "now: https://puppet-compiler.wmflabs.org/output/1056006/3450/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:52:35] (03Merged) 10jenkins-bot: Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [20:52:53] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057976|Enable Parsoid Read Views on {en,he}wikivoyage (T365367)]] [20:52:58] T365367: [EPIC] Deploy Parsoid Read Views for English Wikivoyage and Hebrew Wikivoyage - https://phabricator.wikimedia.org/T365367 [20:53:27] (03CR) 10Dzahn: [C:03+2] phabricator: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:56:58] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#10029936 (10Dzahn) [20:58:36] !log cjming@deploy1003 cjming, cscott: Backport for [[gerrit:1057976|Enable Parsoid Read Views on {en,he}wikivoyage (T365367)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:58:47] ok, testing! [20:58:49] T365367: [EPIC] Deploy Parsoid Read Views for English Wikivoyage and Hebrew Wikivoyage - https://phabricator.wikimedia.org/T365367 [20:59:03] cscott: thanks - i'll sync on your word [21:00:36] (03CR) 10Dzahn: "noop confirmed - carefully self-merged and deployed first only on phab2002 and then on phab1004. the only thing that actually happened was" [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:01:19] cjming: yep, works on mwdebug1001.eqiad ! [21:01:24] yay! [21:01:28] !log cjming@deploy1003 cjming, cscott: Continuing with sync [21:06:11] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057976|Enable Parsoid Read Views on {en,he}wikivoyage (T365367)]] (duration: 13m 18s) [21:06:16] T365367: [EPIC] Deploy Parsoid Read Views for English Wikivoyage and Hebrew Wikivoyage - https://phabricator.wikimedia.org/T365367 [21:06:22] cscott: should be live! [21:06:29] testing it now1 [21:06:44] urbanecm: do i have time to do one more? no big deal if not - i can reschedule for tomorrow [21:07:02] cjming: i can wait a bit more, no problem. [21:07:10] if you want me to deploy something, can do. up2you. [21:07:23] should be quick - thanks! [21:07:36] (03PS23) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [21:07:56] cjming: looks great, thanks so much! [21:08:06] yw! glad to hear it :) [21:09:14] urbanecm: Q for you -- https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1046732 << i'm getting: ```Change '1056062', project 'operations/puppet', branch 'production' not found in any deployed wikiversion. Deployed wikiversions: ['1.43.0-wmf.15', '1.43.0-wmf.16']``` is this ok? [21:09:46] cjming: yeah, that's because of the depends-on headers [21:09:54] i think so - dependent patches -- right on [21:09:56] the puppet patch is merged&deployed already [21:10:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [21:10:22] if the patch scap warns about was unmerged, then that would be a fair question to ask to the requesting person [21:10:25] whether that's expected or not [21:10:33] but in this case, it's merged, so not a reason to worry. [21:10:43] cool - thanks [21:11:03] (03Merged) 10jenkins-bot: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [21:11:19] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1046732|Deploy MetricsPlatform to beta cluster (T366234)]] [21:11:24] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [21:14:51] !log cjming@deploy1003 cjming: Backport for [[gerrit:1046732|Deploy MetricsPlatform to beta cluster (T366234)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:16:16] (03CR) 10Máté Szabó: [C:03+1] Grant checkuser-temporary-account-no-preference to suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058143 (https://phabricator.wikimedia.org/T371364) (owner: 10Dreamy Jazz) [21:18:25] !log cjming@deploy1003 cjming: Continuing with sync [21:23:01] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1046732|Deploy MetricsPlatform to beta cluster (T366234)]] (duration: 11m 41s) [21:23:06] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [21:23:43] urbanecm: all yours! [21:23:50] thanks! [21:24:41] (03CR) 10Bking: [C:03+2] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [21:27:51] testing at mwdebug [21:34:18] (03PS1) 10Urbanecm: Fix resource response to use JSON content type header [extensions/OAuth] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058258 (https://phabricator.wikimedia.org/T263870) [21:34:33] (03PS1) 10Urbanecm: Fix resource response to use JSON content type header [extensions/OAuth] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058259 (https://phabricator.wikimedia.org/T263870) [21:35:18] (03CR) 10Urbanecm: [C:03+2] Fix resource response to use JSON content type header [extensions/OAuth] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058258 (https://phabricator.wikimedia.org/T263870) (owner: 10Urbanecm) [21:35:26] (03CR) 10Urbanecm: [C:03+2] Fix resource response to use JSON content type header [extensions/OAuth] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058259 (https://phabricator.wikimedia.org/T263870) (owner: 10Urbanecm) [21:38:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/OAuth] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058258 (https://phabricator.wikimedia.org/T263870) (owner: 10Urbanecm) [21:38:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/OAuth] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058259 (https://phabricator.wikimedia.org/T263870) (owner: 10Urbanecm) [21:43:18] (03PS1) 10Dwisehaupt: icinga: Add frqueue2003 pay-lb2001 and pay-lb2002 [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) [21:45:17] (03Merged) 10jenkins-bot: Fix resource response to use JSON content type header [extensions/OAuth] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058258 (https://phabricator.wikimedia.org/T263870) (owner: 10Urbanecm) [21:45:19] (03Merged) 10jenkins-bot: Fix resource response to use JSON content type header [extensions/OAuth] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058259 (https://phabricator.wikimedia.org/T263870) (owner: 10Urbanecm) [21:45:42] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1058258|Fix resource response to use JSON content type header (T263870)]], [[gerrit:1058259|Fix resource response to use JSON content type header (T263870)]] [21:45:47] T263870: Content-type on OAuth 2.0 profile endpoint is text/html, should be application/json - https://phabricator.wikimedia.org/T263870 [21:47:25] (03PS1) 10Dwisehaupt: crm: Add another dir to www_admin grouped dirs [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) [21:48:08] (03CR) 10Dwisehaupt: "This can roll out at any point without coordination." [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [21:53:51] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1058258|Fix resource response to use JSON content type header (T263870)]], [[gerrit:1058259|Fix resource response to use JSON content type header (T263870)]] (duration: 08m 09s) [21:53:59] T263870: Content-type on OAuth 2.0 profile endpoint is text/html, should be application/json - https://phabricator.wikimedia.org/T263870 [21:55:52] (03PS1) 10Dzahn: gerrit: add parameter for lfs sync destination host name [puppet] - 10https://gerrit.wikimedia.org/r/1058264 (https://phabricator.wikimedia.org/T257741) [21:59:16] (03CR) 10CI reject: [V:04-1] gerrit: add parameter for lfs sync destination host name [puppet] - 10https://gerrit.wikimedia.org/r/1058264 (https://phabricator.wikimedia.org/T257741) (owner: 10Dzahn) [22:02:39] (03CR) 10JHathaway: [C:03+1] crm: Add another dir to www_admin grouped dirs [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:03:04] (03CR) 10Dzahn: [C:03+2] crm: Add another dir to www_admin grouped dirs [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:09:56] (03CR) 10Dzahn: [C:03+2] "see inline comment. you can also use mkdir_p. that might make things easier." [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:11:59] (03CR) 10Dzahn: [C:03+2] crm: Add another dir to www_admin grouped dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:21:45] (03CR) 10Dwisehaupt: crm: Add another dir to www_admin grouped dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:21:58] (03PS2) 10Dzahn: gerrit: add parameter for lfs sync destination host name [puppet] - 10https://gerrit.wikimedia.org/r/1058264 (https://phabricator.wikimedia.org/T257741) [22:23:01] (03CR) 10Dzahn: [C:03+2] crm: Add another dir to www_admin grouped dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058263 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:28:23] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1241.eqiad.wmnet with OS bullseye [22:28:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bullseye [22:28:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1242.eqiad.wmnet with OS bullseye [22:28:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1241.eqiad.wmnet with OS bull... [22:28:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1243.eqiad.wmnet with OS bull... [22:28:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1242.eqiad.wmnet with OS bull... [22:32:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1245.eqiad.wmnet with OS bullseye [22:32:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1245.eqiad.wmnet with OS bull... [22:34:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [22:45:06] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1242.eqiad.wmnet with reason: host reimage [22:45:12] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1243.eqiad.wmnet with reason: host reimage [22:45:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1241.eqiad.wmnet with reason: host reimage [22:47:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1242.eqiad.wmnet with reason: host reimage [22:48:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030204 (10Jclark-ctr) [22:49:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1245.eqiad.wmnet with reason: host reimage [22:50:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1243.eqiad.wmnet with reason: host reimage [22:53:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1245.eqiad.wmnet with reason: host reimage [22:56:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1241.eqiad.wmnet with reason: host reimage [22:56:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030221 (10Jclark-ctr) [22:58:07] (03PS5) 10Hashar: phabricator: remove git.wikimedia.org vhost, rewrites and tests [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [23:01:04] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1006982/3455/" [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [23:04:22] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:05:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1244.mgmt.eqiad.wmnet with reboot policy FORCED [23:05:51] (03CR) 10Dzahn: [V:03+1 C:03+2] "deployed first on phab2002, then phab1004, apache site removed, service refreshed, see no issues" [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [23:06:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1244.eqiad.wmnet with OS bullseye [23:06:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:06:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1242.eqiad.wmnet with OS bullseye [23:06:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1244.eqiad.wmnet with OS bull... [23:06:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1242.eqiad.wmnet with OS bullseye... [23:06:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1246.eqiad.wmnet with OS bullseye [23:06:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1246.eqiad.wmnet with OS bull... [23:07:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:08:40] !log removing 2 files for legal compliance [23:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:09:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1243.eqiad.wmnet with OS bullseye [23:09:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1243.eqiad.wmnet with OS bullseye... [23:09:25] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1247.eqiad.wmnet with OS bullseye [23:09:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030235 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1247.eqiad.wmnet with OS bull... [23:11:33] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:12:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:12:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1245.eqiad.wmnet with OS bullseye [23:12:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1245.eqiad.wmnet with OS bullseye... [23:13:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1248.mgmt.eqiad.wmnet with reboot policy FORCED [23:14:10] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:14:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1249.mgmt.eqiad.wmnet with reboot policy FORCED [23:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - ps1-b4-eqiad - https://phabricator.wikimedia.org/T371100#10030246 (10Dzahn) [23:15:26] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1249.eqiad.wmnet with OS bullseye [23:15:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1249.eqiad.wmnet with OS bull... [23:15:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:15:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1241.eqiad.wmnet with OS bullseye [23:15:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030251 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1241.eqiad.wmnet with OS bullseye... [23:19:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030258 (10Jclark-ctr) [23:22:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1248.mgmt.eqiad.wmnet with reboot policy FORCED [23:22:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1244.eqiad.wmnet with reason: host reimage [23:23:20] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1246.eqiad.wmnet with reason: host reimage [23:25:39] (03PS1) 10Daimona Eaytoy: beta: Enable invitation lists for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058278 (https://phabricator.wikimedia.org/T370938) [23:25:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1244.eqiad.wmnet with reason: host reimage [23:25:55] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1247.eqiad.wmnet with reason: host reimage [23:26:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1248.eqiad.wmnet with OS bullseye [23:26:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1248.eqiad.wmnet with OS bull... [23:26:37] !log removing 1 file for legal compliance [23:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058278 (https://phabricator.wikimedia.org/T370938) (owner: 10Daimona Eaytoy) [23:28:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1246.eqiad.wmnet with reason: host reimage [23:31:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1247.eqiad.wmnet with reason: host reimage [23:32:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1249.eqiad.wmnet with reason: host reimage [23:34:55] !log removing 1 file for legal compliance [23:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1249.eqiad.wmnet with reason: host reimage [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058280 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058280 (owner: 10TrainBranchBot) [23:42:46] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1248.eqiad.wmnet with reason: host reimage [23:43:18] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:44:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:44:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1244.eqiad.wmnet with OS bullseye [23:44:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1244.eqiad.wmnet with OS bullseye... [23:45:09] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:45:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1248.eqiad.wmnet with reason: host reimage [23:45:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030283 (10Jclark-ctr) [23:46:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:46:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1246.eqiad.wmnet with OS bullseye [23:46:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1246.eqiad.wmnet with OS bullseye... [23:46:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030286 (10Jclark-ctr) [23:48:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:49:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:49:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1247.eqiad.wmnet with OS bullseye [23:49:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1247.eqiad.wmnet with OS bullseye... [23:50:19] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host wikikube-worker1248.mgmt.eqiad.wmnet with reboot policy FORCED [23:52:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:53:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:53:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1249.eqiad.wmnet with OS bullseye [23:53:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1249.eqiad.wmnet with OS bullseye... [23:53:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030297 (10Jclark-ctr) [23:59:22] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed