[00:13:46] (03PS1) 10Andrea Denisse: ssl: Delete dummy TLS key for the Grafana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) [00:14:46] (03CR) 10Andrea Denisse: "This patch is to be merged after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025860" [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [00:18:29] (03PS1) 10Jdlrobson: Use new configuration for wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) [00:22:48] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@b10376a]: (no justification provided) [00:23:19] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@b10376a]: (no justification provided) (duration: 00m 31s) [00:27:29] (03PS3) 10Ssingh: magru: add lvs700[1-3] and related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) [00:29:34] (03PS1) 10Andrea Denisse: logstash: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) [00:31:55] (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [00:32:01] (03CR) 10Ssingh: [C:03+2] magru: add lvs700[1-3] and related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [00:33:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7001.magru.wmnet with OS bullseye [00:34:00] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host lvs7001.magru.wmnet with OS bullseye [00:43:18] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: (2) VMs for ncredir - https://phabricator.wikimedia.org/T363881 (10BCornwall) 03NEW [00:58:18] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage [00:58:53] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:02:09] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage [01:25:41] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [01:26:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [01:26:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7001.magru.wmnet with OS bullseye [01:26:55] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760118 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host lvs7001.magru.wmnet with OS bullseye compl... [01:37:36] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7002.magru.wmnet with OS bullseye [01:37:49] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host lvs7002.magru.wmnet with OS bullseye [01:39:27] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760151 (10ssingh) [01:45:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:05] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7002.magru.wmnet with reason: host reimage [02:05:26] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7002.magru.wmnet with reason: host reimage [02:10:56] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:29:02] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760193 (10ssingh) [02:29:59] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [02:31:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [02:31:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7002.magru.wmnet with OS bullseye [02:31:07] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host lvs7002.magru.wmnet with OS bullseye compl... [02:31:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:38:53] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:39] (03PS4) 10Sohom Datta: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [02:52:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (dbprov1006, ...), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:00:26] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:08] PROBLEM - PyBal IPVS diff check on lvs7001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:25:46] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:31:00] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:01:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:10:28] PROBLEM - PyBal IPVS diff check on lvs7002 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [04:50:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:50:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:53:21] (03PS1) 10Marostegui: db1234: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025895 (https://phabricator.wikimedia.org/T363890) [04:54:03] (03CR) 10Marostegui: [C:03+2] db1234: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025895 (https://phabricator.wikimedia.org/T363890) (owner: 10Marostegui) [04:54:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1234.eqiad.wmnet with OS bookworm [04:54:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2121.codfw.wmnet with reason: Maintenance [04:55:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2121.codfw.wmnet with reason: Maintenance [04:55:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T361627)', diff saved to https://phabricator.wikimedia.org/P61500 and previous config saved to /var/cache/conftool/dbconfig/20240501-045517-marostegui.json [04:55:20] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:56:14] (03PS1) 10Marostegui: db1236: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025896 [04:56:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1236', diff saved to https://phabricator.wikimedia.org/P61501 and previous config saved to /var/cache/conftool/dbconfig/20240501-045624-marostegui.json [04:57:15] (03CR) 10Marostegui: [C:03+2] db1236: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025896 (owner: 10Marostegui) [04:57:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1236.eqiad.wmnet with OS bookworm [05:00:32] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9760278 (10Marostegui) Thank you John! [05:01:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:01:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T361627)', diff saved to https://phabricator.wikimedia.org/P61502 and previous config saved to /var/cache/conftool/dbconfig/20240501-050135-marostegui.json [05:01:38] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:07:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage [05:08:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Down with HW issues [05:08:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Down with HW issues [05:10:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage [05:10:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage [05:14:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage [05:16:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P61503 and previous config saved to /var/cache/conftool/dbconfig/20240501-051642-marostegui.json [05:18:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186 to clone db1234 T363890', diff saved to https://phabricator.wikimedia.org/P61504 and previous config saved to /var/cache/conftool/dbconfig/20240501-051848-marostegui.json [05:18:51] T363890: Reimage and reclone db1234 - https://phabricator.wikimedia.org/T363890 [05:22:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:21] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1186.eqiad.wmnet onto db1234.eqiad.wmnet [05:25:18] (03PS1) 10Marostegui: Revert "db1236: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025765 [05:28:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61505 and previous config saved to /var/cache/conftool/dbconfig/20240501-052810-root.json [05:28:43] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1025906 (https://phabricator.wikimedia.org/T363892) [05:28:48] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025907 (https://phabricator.wikimedia.org/T363892) [05:29:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1236.eqiad.wmnet with OS bookworm [05:29:58] (03CR) 10Marostegui: [C:03+2] Revert "db1236: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025765 (owner: 10Marostegui) [05:31:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1234.eqiad.wmnet with OS bookworm [05:31:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P61506 and previous config saved to /var/cache/conftool/dbconfig/20240501-053149-marostegui.json [05:33:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 6 hosts with reason: Setting up T355285 T355424 [05:33:37] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [05:33:38] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [05:33:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 6 hosts with reason: Setting up T355285 T355424 [05:33:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on es[1035,1039-1040].eqiad.wmnet with reason: Setting up T355285 T355424 [05:34:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es[1035,1039-1040].eqiad.wmnet with reason: Setting up T355285 T355424 [05:38:42] (03PS1) 10Marostegui: mariadb: Productionize codfw es7 servers [puppet] - 10https://gerrit.wikimedia.org/r/1025899 (https://phabricator.wikimedia.org/T355424) [05:39:24] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize codfw es7 servers [puppet] - 10https://gerrit.wikimedia.org/r/1025899 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [05:43:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61507 and previous config saved to /var/cache/conftool/dbconfig/20240501-054316-root.json [05:46:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T361627)', diff saved to https://phabricator.wikimedia.org/P61508 and previous config saved to /var/cache/conftool/dbconfig/20240501-054657-marostegui.json [05:47:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:47:00] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:47:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:47:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T361627)', diff saved to https://phabricator.wikimedia.org/P61509 and previous config saved to /var/cache/conftool/dbconfig/20240501-054720-marostegui.json [05:51:04] (03PS1) 10Marostegui: es2038: Make it es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1025900 (https://phabricator.wikimedia.org/T355424) [05:51:40] (03CR) 10Marostegui: [C:03+2] es2038: Make it es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1025900 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [05:52:08] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:53:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T361627)', diff saved to https://phabricator.wikimedia.org/P61510 and previous config saved to /var/cache/conftool/dbconfig/20240501-055353-marostegui.json [05:53:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:56:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2166', diff saved to https://phabricator.wikimedia.org/P61511 and previous config saved to /var/cache/conftool/dbconfig/20240501-055657-root.json [05:57:32] (03PS1) 10Marostegui: db2166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025901 [05:58:01] (03CR) 10Marostegui: [C:03+2] db2166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025901 (owner: 10Marostegui) [05:58:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2166.codfw.wmnet with OS bookworm [05:58:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61512 and previous config saved to /var/cache/conftool/dbconfig/20240501-055822-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T0600) [06:01:26] (03PS1) 10Marostegui: conftool: Add es7 as valid section [puppet] - 10https://gerrit.wikimedia.org/r/1025902 (https://phabricator.wikimedia.org/T355285) [06:05:58] (03CR) 10Marostegui: [C:03+2] conftool: Add es7 as valid section [puppet] - 10https://gerrit.wikimedia.org/r/1025902 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [06:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P61513 and previous config saved to /var/cache/conftool/dbconfig/20240501-060900-marostegui.json [06:13:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61514 and previous config saved to /var/cache/conftool/dbconfig/20240501-061327-root.json [06:13:40] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:15:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2166.codfw.wmnet with reason: host reimage [06:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:17:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: host reimage [06:18:42] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [06:21:47] swfrench-wmf: does that alert about confd has something to do with the maintenance ^ [06:24:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P61515 and previous config saved to /var/cache/conftool/dbconfig/20240501-062407-marostegui.json [06:28:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61516 and previous config saved to /var/cache/conftool/dbconfig/20240501-062833-root.json [06:30:14] (03PS1) 10Marostegui: Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026026 [06:32:50] (03PS1) 10Marostegui: check_depooled: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1026022 [06:32:50] (03CR) 10Marostegui: [C:03+2] Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026026 (owner: 10Marostegui) [06:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61517 and previous config saved to /var/cache/conftool/dbconfig/20240501-063318-root.json [06:33:35] (03CR) 10Marostegui: [C:03+2] check_depooled: Add es6 and es7 [software] - 10https://gerrit.wikimedia.org/r/1026022 (owner: 10Marostegui) [06:33:47] (03Abandoned) 10Marostegui: check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025675 (owner: 10Marostegui) [06:38:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2166.codfw.wmnet with OS bookworm [06:39:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T361627)', diff saved to https://phabricator.wikimedia.org/P61518 and previous config saved to /var/cache/conftool/dbconfig/20240501-063919-marostegui.json [06:39:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [06:39:23] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:39:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [06:39:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T361627)', diff saved to https://phabricator.wikimedia.org/P61519 and previous config saved to /var/cache/conftool/dbconfig/20240501-063942-marostegui.json [06:43:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61520 and previous config saved to /var/cache/conftool/dbconfig/20240501-064339-root.json [06:46:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T361627)', diff saved to https://phabricator.wikimedia.org/P61521 and previous config saved to /var/cache/conftool/dbconfig/20240501-064600-marostegui.json [06:46:03] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:48:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61522 and previous config saved to /var/cache/conftool/dbconfig/20240501-064824-root.json [06:58:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61523 and previous config saved to /var/cache/conftool/dbconfig/20240501-065845-root.json [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P61524 and previous config saved to /var/cache/conftool/dbconfig/20240501-070108-marostegui.json [07:02:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1186.eqiad.wmnet onto db1234.eqiad.wmnet [07:03:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61525 and previous config saved to /var/cache/conftool/dbconfig/20240501-070330-root.json [07:03:53] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:04:46] (03PS1) 10Marostegui: Revert "db1234: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026027 [07:05:12] (03CR) 10Marostegui: [C:03+2] Revert "db1234: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026027 (owner: 10Marostegui) [07:06:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61526 and previous config saved to /var/cache/conftool/dbconfig/20240501-070603-root.json [07:07:58] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9760385 (10Marostegui) [07:16:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P61527 and previous config saved to /var/cache/conftool/dbconfig/20240501-071615-marostegui.json [07:18:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61528 and previous config saved to /var/cache/conftool/dbconfig/20240501-071836-root.json [07:21:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61529 and previous config saved to /var/cache/conftool/dbconfig/20240501-072110-root.json [07:21:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61530 and previous config saved to /var/cache/conftool/dbconfig/20240501-072149-root.json [07:31:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T361627)', diff saved to https://phabricator.wikimedia.org/P61531 and previous config saved to /var/cache/conftool/dbconfig/20240501-073123-marostegui.json [07:31:26] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:31:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [07:31:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [07:31:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:31:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T361627)', diff saved to https://phabricator.wikimedia.org/P61532 and previous config saved to /var/cache/conftool/dbconfig/20240501-073201-marostegui.json [07:33:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61533 and previous config saved to /var/cache/conftool/dbconfig/20240501-073342-root.json [07:36:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61534 and previous config saved to /var/cache/conftool/dbconfig/20240501-073615-root.json [07:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61535 and previous config saved to /var/cache/conftool/dbconfig/20240501-073655-root.json [07:38:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T361627)', diff saved to https://phabricator.wikimedia.org/P61536 and previous config saved to /var/cache/conftool/dbconfig/20240501-073812-marostegui.json [07:38:15] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:42:09] (03PS1) 10Marostegui: installserver: Allowing formatting db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1026083 (https://phabricator.wikimedia.org/T363119) [07:44:04] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T363895 (10LSobanski) 03NEW [07:46:34] (03CR) 10Marostegui: [C:03+2] installserver: Allowing formatting db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1026083 (https://phabricator.wikimedia.org/T363119) (owner: 10Marostegui) [07:48:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61537 and previous config saved to /var/cache/conftool/dbconfig/20240501-074848-root.json [07:48:55] 10ops-eqiad, 06SRE, 06DBA, 13Patch-For-Review: db1246 crashed - https://phabricator.wikimedia.org/T363119#9760444 (10Marostegui) >>! In T363119#9760443, @gerritbot wrote: > Change #1026083 **merged** by Marostegui: > %%%[operations/puppet@production] installserver: Allowing formatting db1246%%% > https://g... [07:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61538 and previous config saved to /var/cache/conftool/dbconfig/20240501-075124-root.json [07:52:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61539 and previous config saved to /var/cache/conftool/dbconfig/20240501-075200-root.json [07:53:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P61540 and previous config saved to /var/cache/conftool/dbconfig/20240501-075320-marostegui.json [07:53:47] (03PS1) 10Marostegui: db2164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026085 [07:55:59] (03CR) 10Marostegui: [C:03+2] db2164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026085 (owner: 10Marostegui) [07:56:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2164', diff saved to https://phabricator.wikimedia.org/P61541 and previous config saved to /var/cache/conftool/dbconfig/20240501-075614-root.json [07:59:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2164.codfw.wmnet with OS bookworm [08:00:04] jnuche and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T0800). [08:01:41] morning, most people in UTC/Europe times are off today, including SREs [08:01:59] I'm going to be extra cautious and delay the train deployment to PDT time so there are more people around just in case of an emergency [08:02:08] right now I'm thinking 08:00 PDT/15:00 UTC for the rollout [08:03:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61542 and previous config saved to /var/cache/conftool/dbconfig/20240501-080354-root.json [08:05:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:05:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61543 and previous config saved to /var/cache/conftool/dbconfig/20240501-080630-root.json [08:07:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61544 and previous config saved to /var/cache/conftool/dbconfig/20240501-080706-root.json [08:08:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P61545 and previous config saved to /var/cache/conftool/dbconfig/20240501-080827-marostegui.json [08:17:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2164.codfw.wmnet with reason: host reimage [08:20:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2164.codfw.wmnet with reason: host reimage [08:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61546 and previous config saved to /var/cache/conftool/dbconfig/20240501-082135-root.json [08:22:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61547 and previous config saved to /var/cache/conftool/dbconfig/20240501-082211-root.json [08:23:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T361627)', diff saved to https://phabricator.wikimedia.org/P61548 and previous config saved to /var/cache/conftool/dbconfig/20240501-082334-marostegui.json [08:23:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [08:23:37] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:23:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [08:23:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T361627)', diff saved to https://phabricator.wikimedia.org/P61549 and previous config saved to /var/cache/conftool/dbconfig/20240501-082357-marostegui.json [08:27:15] (03PS1) 10Marostegui: instances.yaml: Add es7 eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026089 (https://phabricator.wikimedia.org/T355285) [08:27:50] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es7 eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026089 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [08:29:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T361627)', diff saved to https://phabricator.wikimedia.org/P61550 and previous config saved to /var/cache/conftool/dbconfig/20240501-082928-marostegui.json [08:29:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:31:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Push es7 eqiad config T355285', diff saved to https://phabricator.wikimedia.org/P61551 and previous config saved to /var/cache/conftool/dbconfig/20240501-083120-marostegui.json [08:31:23] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [08:35:50] (03PS1) 10Marostegui: Revert "db2164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026029 [08:36:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61552 and previous config saved to /var/cache/conftool/dbconfig/20240501-083641-root.json [08:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61553 and previous config saved to /var/cache/conftool/dbconfig/20240501-083717-root.json [08:39:56] (03CR) 10Marostegui: [C:03+2] Revert "db2164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026029 (owner: 10Marostegui) [08:41:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61554 and previous config saved to /var/cache/conftool/dbconfig/20240501-084116-root.json [08:44:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2164.codfw.wmnet with OS bookworm [08:44:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P61555 and previous config saved to /var/cache/conftool/dbconfig/20240501-084436-marostegui.json [08:46:49] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9760549 (10eoghan) 05Open→03Resolved [08:47:41] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9760548 (10eoghan) Both hosts have now been reprovisioned with public IPs. Thanks @Arnoldokoth for taking care of lists1004! [08:52:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61556 and previous config saved to /var/cache/conftool/dbconfig/20240501-085223-root.json [08:56:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61557 and previous config saved to /var/cache/conftool/dbconfig/20240501-085622-root.json [08:59:37] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 36 probes of 802 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:59:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P61558 and previous config saved to /var/cache/conftool/dbconfig/20240501-085943-marostegui.json [09:01:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:03:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61559 and previous config saved to /var/cache/conftool/dbconfig/20240501-090303-arnaudb.json [09:04:39] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 10 probes of 802 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:09:28] (03PS1) 10Marostegui: instances.yaml: Add es7 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026092 (https://phabricator.wikimedia.org/T355424) [09:09:57] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es7 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026092 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [09:11:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61560 and previous config saved to /var/cache/conftool/dbconfig/20240501-091128-root.json [09:13:06] (03PS1) 10Cathal Mooney: Add netmon group to allow SSH into MRs [homer/public] - 10https://gerrit.wikimedia.org/r/1026094 [09:13:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Push es7 codfw config T355424', diff saved to https://phabricator.wikimedia.org/P61561 and previous config saved to /var/cache/conftool/dbconfig/20240501-091352-marostegui.json [09:13:56] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [09:14:00] (03CR) 10Cathal Mooney: [C:03+2] Add netmon group to allow SSH into MRs [homer/public] - 10https://gerrit.wikimedia.org/r/1026094 (owner: 10Cathal Mooney) [09:14:19] (03CR) 10Btullis: [V:03+1 C:03+2] Update the DPE ceph cluster to reef [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [09:14:20] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: helmfile scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020860 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [09:14:43] (03Merged) 10jenkins-bot: Add netmon group to allow SSH into MRs [homer/public] - 10https://gerrit.wikimedia.org/r/1026094 (owner: 10Cathal Mooney) [09:14:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T361627)', diff saved to https://phabricator.wikimedia.org/P61562 and previous config saved to /var/cache/conftool/dbconfig/20240501-091451-marostegui.json [09:14:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:14:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:15:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:15:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T361627)', diff saved to https://phabricator.wikimedia.org/P61563 and previous config saved to /var/cache/conftool/dbconfig/20240501-091513-marostegui.json [09:15:27] (03Merged) 10jenkins-bot: mw-videoscaler: helmfile scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020860 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [09:18:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61564 and previous config saved to /var/cache/conftool/dbconfig/20240501-091809-arnaudb.json [09:19:37] (03PS1) 10Marostegui: es1035: Remove insetup [puppet] - 10https://gerrit.wikimedia.org/r/1026095 [09:20:07] (03CR) 10Marostegui: [C:03+2] es1035: Remove insetup [puppet] - 10https://gerrit.wikimedia.org/r/1026095 (owner: 10Marostegui) [09:21:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T361627)', diff saved to https://phabricator.wikimedia.org/P61565 and previous config saved to /var/cache/conftool/dbconfig/20240501-092125-marostegui.json [09:21:30] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:22:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:38] !log withdrawing public prefix announcement to AS7195 to test backup in magru (T362421) [09:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] T362421: magru network setup - https://phabricator.wikimedia.org/T362421 [09:24:14] (03PS1) 10Marostegui: etcd.php: Add es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026096 (https://phabricator.wikimedia.org/T355285) [09:25:07] (03CR) 10Marostegui: [C:03+2] etcd.php: Add es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026096 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:25:54] (03Merged) 10jenkins-bot: etcd.php: Add es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026096 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61566 and previous config saved to /var/cache/conftool/dbconfig/20240501-092634-root.json [09:27:32] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1026096|etcd.php: Add es7 (T355285 T355424)]] [09:27:36] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [09:27:36] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [09:30:18] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1026096|etcd.php: Add es7 (T355285 T355424)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:30:22] !log marostegui@deploy1002 marostegui: Continuing with sync [09:31:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:33:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 15%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61567 and previous config saved to /var/cache/conftool/dbconfig/20240501-093315-arnaudb.json [09:36:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P61568 and previous config saved to /var/cache/conftool/dbconfig/20240501-093635-marostegui.json [09:41:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61569 and previous config saved to /var/cache/conftool/dbconfig/20240501-094140-root.json [09:42:26] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1026096|etcd.php: Add es7 (T355285 T355424)]] (duration: 14m 53s) [09:42:29] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [09:42:30] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [09:48:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61570 and previous config saved to /var/cache/conftool/dbconfig/20240501-094821-arnaudb.json [09:51:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P61571 and previous config saved to /var/cache/conftool/dbconfig/20240501-095142-marostegui.json [09:52:54] !log restarting routinator service on rpki1001 [09:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61572 and previous config saved to /var/cache/conftool/dbconfig/20240501-095646-root.json [09:58:42] (03PS1) 10Marostegui: db2163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026098 [09:58:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2163', diff saved to https://phabricator.wikimedia.org/P61573 and previous config saved to /var/cache/conftool/dbconfig/20240501-095845-root.json [09:58:53] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:23] (03CR) 10Marostegui: [C:03+2] db2163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026098 (owner: 10Marostegui) [09:59:58] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T363895#9760669 (10cmooney) p:05Triage→03Low These are direct peerings to Equinix tehmselves over their own exchange. We are waiting on them to complet... [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1000) [10:00:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2163.codfw.wmnet with OS bookworm [10:03:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61574 and previous config saved to /var/cache/conftool/dbconfig/20240501-100326-arnaudb.json [10:06:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T361627)', diff saved to https://phabricator.wikimedia.org/P61575 and previous config saved to /var/cache/conftool/dbconfig/20240501-100650-marostegui.json [10:06:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [10:06:57] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:07:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [10:08:56] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:11:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61576 and previous config saved to /var/cache/conftool/dbconfig/20240501-101151-root.json [10:12:04] (03PS1) 10Marostegui: Revert "db2163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026040 [10:12:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [10:12:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [10:13:53] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:17:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [10:17:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [10:17:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T361627)', diff saved to https://phabricator.wikimedia.org/P61577 and previous config saved to /var/cache/conftool/dbconfig/20240501-101728-marostegui.json [10:17:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:17:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2163.codfw.wmnet with reason: host reimage [10:18:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61578 and previous config saved to /var/cache/conftool/dbconfig/20240501-101832-arnaudb.json [10:20:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: host reimage [10:22:07] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bullseye [10:22:13] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host lvs7003.magru.wmnet with OS bullseye [10:22:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:22:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T361627)', diff saved to https://phabricator.wikimedia.org/P61579 and previous config saved to /var/cache/conftool/dbconfig/20240501-102253-marostegui.json [10:22:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:27:41] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs7003.magru.wmnet'] [10:28:00] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs7003.magru.wmnet'] [10:29:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Down with HW issues [10:29:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Down with HW issues [10:30:01] (03PS2) 10Effie Mouzeli: memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) [10:30:40] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs7003.magru.wmnet with OS bullseye [10:30:49] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host lvs7003.magru.wmnet with OS bullseye execu... [10:30:54] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bullseye [10:30:59] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host lvs7003.magru.wmnet with OS bullseye [10:33:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: post schema change repool', diff saved to https://phabricator.wikimedia.org/P61580 and previous config saved to /var/cache/conftool/dbconfig/20240501-103338-arnaudb.json [10:37:09] (03CR) 10Marostegui: [C:03+2] Revert "db2163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026040 (owner: 10Marostegui) [10:37:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61581 and previous config saved to /var/cache/conftool/dbconfig/20240501-103758-root.json [10:38:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P61582 and previous config saved to /var/cache/conftool/dbconfig/20240501-103801-marostegui.json [10:42:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:42:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2163.codfw.wmnet with OS bookworm [10:42:43] (03PS7) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [10:47:11] FIRING: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:52:11] RESOLVED: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:53:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61583 and previous config saved to /var/cache/conftool/dbconfig/20240501-105304-root.json [10:53:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P61584 and previous config saved to /var/cache/conftool/dbconfig/20240501-105315-marostegui.json [10:55:29] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [10:55:35] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7001.magru.wmnet [10:58:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [11:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1100). [11:05:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7002.magru.wmnet [11:07:42] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host lvs7001.magru.wmnet [11:08:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61585 and previous config saved to /var/cache/conftool/dbconfig/20240501-110809-root.json [11:08:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T361627)', diff saved to https://phabricator.wikimedia.org/P61586 and previous config saved to /var/cache/conftool/dbconfig/20240501-110822-marostegui.json [11:08:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [11:08:25] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:08:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [11:08:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T361627)', diff saved to https://phabricator.wikimedia.org/P61587 and previous config saved to /var/cache/conftool/dbconfig/20240501-110834-marostegui.json [11:09:03] (03PS3) 10Effie Mouzeli: memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) [11:12:17] PROBLEM - PyBal IPVS diff check on lvs7001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:12:34] ^ that's fine [11:13:06] (03PS1) 10Btullis: Remove the cephadm role [puppet] - 10https://gerrit.wikimedia.org/r/1026100 (https://phabricator.wikimedia.org/T363559) [11:13:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T361627)', diff saved to https://phabricator.wikimedia.org/P61588 and previous config saved to /var/cache/conftool/dbconfig/20240501-111353-marostegui.json [11:13:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:15:44] (03PS4) 10Effie Mouzeli: memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) [11:17:21] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host lvs7002.magru.wmnet [11:18:40] (03CR) 10CI reject: [V:04-1] memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) (owner: 10Effie Mouzeli) [11:19:12] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760772 (10ssingh) [11:21:57] PROBLEM - PyBal IPVS diff check on lvs7002 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:22:51] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [11:23:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61589 and previous config saved to /var/cache/conftool/dbconfig/20240501-112315-root.json [11:24:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [11:24:07] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7003.magru.wmnet with OS bullseye [11:24:18] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host lvs7003.magru.wmnet with OS bullseye compl... [11:26:58] (03PS5) 10Effie Mouzeli: memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) [11:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P61590 and previous config saved to /var/cache/conftool/dbconfig/20240501-112900-marostegui.json [11:38:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61591 and previous config saved to /var/cache/conftool/dbconfig/20240501-113821-root.json [11:44:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P61592 and previous config saved to /var/cache/conftool/dbconfig/20240501-114408-marostegui.json [11:51:48] (03PS8) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [11:52:10] (03CR) 10CI reject: [V:04-1] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [11:52:26] (03PS9) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [11:52:46] (03CR) 10CI reject: [V:04-1] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [11:53:25] (03PS10) 10MVernon: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) [11:53:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61593 and previous config saved to /var/cache/conftool/dbconfig/20240501-115327-root.json [11:59:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T361627)', diff saved to https://phabricator.wikimedia.org/P61594 and previous config saved to /var/cache/conftool/dbconfig/20240501-115915-marostegui.json [11:59:18] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:08:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61595 and previous config saved to /var/cache/conftool/dbconfig/20240501-120833-root.json [12:13:37] (03PS1) 10Marostegui: db2154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026104 [12:13:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2154', diff saved to https://phabricator.wikimedia.org/P61596 and previous config saved to /var/cache/conftool/dbconfig/20240501-121347-root.json [12:14:14] (03CR) 10Marostegui: [C:03+2] db2154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026104 (owner: 10Marostegui) [12:15:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2218.codfw.wmnet with reason: Maintenance [12:15:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2218.codfw.wmnet with reason: Maintenance [12:15:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bookworm [12:19:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:19:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:19:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:20:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:20:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T361627)', diff saved to https://phabricator.wikimedia.org/P61597 and previous config saved to /var/cache/conftool/dbconfig/20240501-122012-marostegui.json [12:20:15] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:22:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T361627)', diff saved to https://phabricator.wikimedia.org/P61598 and previous config saved to /var/cache/conftool/dbconfig/20240501-122224-marostegui.json [12:24:54] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bookworm [12:31:01] (03CR) 10Btullis: [C:03+2] Remove the cephadm role [puppet] - 10https://gerrit.wikimedia.org/r/1026100 (https://phabricator.wikimedia.org/T363559) (owner: 10Btullis) [12:32:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [12:35:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [12:37:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P61599 and previous config saved to /var/cache/conftool/dbconfig/20240501-123732-marostegui.json [12:42:04] (03PS1) 10Marostegui: Revert "db2154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026126 [12:45:59] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [12:48:55] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [12:51:59] (03CR) 10Marostegui: [C:03+2] Revert "db2154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026126 (owner: 10Marostegui) [12:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61600 and previous config saved to /var/cache/conftool/dbconfig/20240501-125158-root.json [12:52:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P61601 and previous config saved to /var/cache/conftool/dbconfig/20240501-125239-marostegui.json [12:55:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2154.codfw.wmnet with OS bookworm [12:58:44] (03PS1) 10Phuedx: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:58:44] (03CR) 10Phuedx: "See inline for a question and a suggestion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [12:59:18] PROBLEM - PyBal IPVS diff check on lvs7003 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1300) [13:00:05] hnowlan, Sohom_Datta, and DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:03] o/ [13:01:26] I am around [13:07:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61602 and previous config saved to /var/cache/conftool/dbconfig/20240501-130704-root.json [13:07:18] we might be a little short on deployers given that it's IWD [13:07:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T361627)', diff saved to https://phabricator.wikimedia.org/P61603 and previous config saved to /var/cache/conftool/dbconfig/20240501-130747-marostegui.json [13:07:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:07:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:08:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:08:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T361627)', diff saved to https://phabricator.wikimedia.org/P61604 and previous config saved to /var/cache/conftool/dbconfig/20240501-130822-marostegui.json [13:13:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T361627)', diff saved to https://phabricator.wikimedia.org/P61605 and previous config saved to /var/cache/conftool/dbconfig/20240501-131351-marostegui.json [13:13:59] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:15:47] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bookworm [13:18:42] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/1020191/2218/" [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) (owner: 10Effie Mouzeli) [13:18:44] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9760906 (10bking) [13:22:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61606 and previous config saved to /var/cache/conftool/dbconfig/20240501-132211-root.json [13:22:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:10] (03PS1) 10Ssingh: magru: depool geoip/text* [dns] - 10https://gerrit.wikimedia.org/r/1026119 (https://phabricator.wikimedia.org/T346722) [13:25:33] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9760910 (10bking) 05Open→03Resolved As of yesterday, the production Elastic clusters are using CFSSL, which means we've accomplished our... [13:25:38] (03CR) 10Ssingh: [C:03+2] magru: depool geoip/text* [dns] - 10https://gerrit.wikimedia.org/r/1026119 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:25:54] !log running authdns-update for CR 1026119: depool magru text* [13:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P61607 and previous config saved to /var/cache/conftool/dbconfig/20240501-132900-marostegui.json [13:29:52] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bookworm [13:30:03] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9760937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm [13:33:09] !log promoting HNowlan (WMF) to admin in testwiki [13:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61608 and previous config saved to /var/cache/conftool/dbconfig/20240501-133717-root.json [13:37:24] marostegui: thanks for flagging the ConfdResourceFailed alerts. I took a quick look yesterday evening and they seemed to be related to magru turnup. [13:37:36] let me take another look [13:37:57] swfrench-wmf: Sure, I have no idea if they were related or not, but just better be safe than sorry :) [13:38:03] That's why I asked :) [13:39:14] yeah they are related sorry. I downtimed them but maybe will extend them [13:39:27] thanks sukhe [13:39:29] a cleanup is required but there might be more than one so I am just saving it till everything is setup (today) [13:41:36] thanks, sukhe, and yeah totally agreed about better being on the safe side, marostegui :) [13:41:55] yep [13:43:04] A bit late into the deploy window, but I'm around as well [13:44:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P61609 and previous config saved to /var/cache/conftool/dbconfig/20240501-134407-marostegui.json [13:45:16] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:45:37] it's curious that the confd logs are only complaining about magru, but there are alerts with labels for other targets [13:47:26] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bookworm [13:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61610 and previous config saved to /var/cache/conftool/dbconfig/20240501-135222-root.json [13:54:41] RECOVERY - Recursive DNS on 195.200.68.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:55:19] RECOVERY - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:55:52] swfrench-wmf: yeah, you are absolutely right. I just couldn't see at that time on what the errors are on the other sites but I thought I will clean up the state, restart confd, and then see if that helps [13:56:14] and if that doesn't help, we will dig deeper [13:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T361627)', diff saved to https://phabricator.wikimedia.org/P61611 and previous config saved to /var/cache/conftool/dbconfig/20240501-135915-marostegui.json [13:59:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:59:20] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:59:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1400) [14:01:24] (03CR) 10Bking: [C:03+2] rdf-streaming-updater: increase s3 socket-timeout to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025850 (https://phabricator.wikimedia.org/T362508) (owner: 10Bking) [14:03:01] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:03:02] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [14:03:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:03:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T361627)', diff saved to https://phabricator.wikimedia.org/P61612 and previous config saved to /var/cache/conftool/dbconfig/20240501-140333-marostegui.json [14:03:36] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:05:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T361627)', diff saved to https://phabricator.wikimedia.org/P61613 and previous config saved to /var/cache/conftool/dbconfig/20240501-140545-marostegui.json [14:05:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:05:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [14:07:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61614 and previous config saved to /var/cache/conftool/dbconfig/20240501-140728-root.json [14:08:03] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [14:10:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [14:11:13] (03PS2) 10Jdrewniak: Enable Vector appearance menu on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [14:11:54] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:12:00] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:13:35] PROBLEM - Recursive DNS on 2a02:ec80:700:1:195:200:68:5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:13:35] PROBLEM - Recursive DNS on 195.200.68.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:13:53] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:16:36] ^ ok resolving this now, it's time [14:19:37] (03CR) 10JHathaway: "Glad to hear it was helpful, your recent changes look good. It is super confusing that file contents are in the catalog, but file source d" [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P61615 and previous config saved to /var/cache/conftool/dbconfig/20240501-142053-marostegui.json [14:21:23] sukhe: it turns out it's just an artifact of how our prom exporter script is determining whether a given target is healthy: by checking whether there's a staged-but-uncommitted (say, because the check command failed, as is happening for magru right now), config file that's newer (mtime) than the live config. [14:21:45] in short, anything with the same basename will (e.g., text-https) will trip all the others with that same basename. [14:21:50] interesting [14:22:00] /usr/local/bin/pybal-eval-check /srv/config-master/pybal/magru/.text151566512' with 1 (0.029816627502441406s) [invalid]: server pool cannot be empty! [14:22:03] this is fine and expected though [14:22:33] but I didn't know about the same basename thing [14:22:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61616 and previous config saved to /var/cache/conftool/dbconfig/20240501-142233-root.json [14:22:38] not fine to /usr/local/bin/pybal-eval-check :) [14:23:44] (03CR) 10MVernon: "> Glad to hear it was helpful, your recent changes look good. It is super confusing that file contents are in the catalog, but file source" [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:23:56] in any case, everything that looks like /srv/config-master/pybal/$SITE/text-https (or any number of other filenames) will be reported as unhealthy [14:24:08] is that intendend? [14:24:30] presumably not - I'll open a task in a bit [14:24:31] because in this case, clearly, magru is unhealthy but not others, though we are getting alerted for others [14:24:40] I see, I thought there is a reason that I don't know about re: confd [14:24:54] (03CR) 10JHathaway: [C:03+1] "yup!" [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:25:13] swfrench-wmf: thanks for looking into it [14:25:17] fortunately, not a confd problem per se - more a monitoring problem [14:25:25] (03CR) 10MVernon: [C:03+2] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:25:35] i'll open a task so it's at least written down and we can figure out if there's a better way to do this [14:25:43] at least for magru and this specific alert, we will clear it up when we pool servers so I will let it be like this for a bit more [14:26:34] sounds good [14:36:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P61617 and previous config saved to /var/cache/conftool/dbconfig/20240501-143601-marostegui.json [14:36:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1002.eqiad.wmnet with OS bookworm [14:38:53] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:22] 10ops-codfw, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T363756#9761122 (10Jhancock.wm) a:03Jhancock.wm fixed the main source of the alert (PSU and power cable reseated) but still getting the following error. Error Code PSU0049 Message Unable to power on the Power Supply Unit (PSU) %1 wi... [14:42:19] !log dancy@deploy1002 Installing scap version "4.81.0" for 325 hosts [14:42:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:43:08] !log dancy@deploy1002 Installation of scap version "4.81.0" completed for 325 hosts [14:47:16] (03PS1) 10JHathaway: cephadm: confine fact to ceph nodes [puppet] - 10https://gerrit.wikimedia.org/r/1026156 (https://phabricator.wikimedia.org/T279621) [14:49:57] 10ops-codfw, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T363756#9761138 (10Jhancock.wm) 05Open→03Resolved removed the error by rebooting the idrac [14:51:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T361627)', diff saved to https://phabricator.wikimedia.org/P61618 and previous config saved to /var/cache/conftool/dbconfig/20240501-145108-marostegui.json [14:51:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:51:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:51:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:51:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T361627)', diff saved to https://phabricator.wikimedia.org/P61619 and previous config saved to /var/cache/conftool/dbconfig/20240501-145131-marostegui.json [14:52:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T361627)', diff saved to https://phabricator.wikimedia.org/P61620 and previous config saved to /var/cache/conftool/dbconfig/20240501-145243-marostegui.json [14:52:49] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362107#9761151 (10bking) @MoritzMuehlenhoff Thanks for the feedback, you've given me some food for thought. Here are my thoughts: - Like etcd, Opens... [14:53:30] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363847#9761155 (10Jhancock.wm) 05Open→03Declined see T362938 [14:53:34] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1026156 (https://phabricator.wikimedia.org/T279621) (owner: 10JHathaway) [14:53:52] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363838#9761160 (10Jhancock.wm) 05Open→03Declined see T362938 [14:53:53] (03CR) 10JHathaway: [C:03+2] cephadm: confine fact to ceph nodes [puppet] - 10https://gerrit.wikimedia.org/r/1026156 (https://phabricator.wikimedia.org/T279621) (owner: 10JHathaway) [14:57:08] (03PS9) 10TChin: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [14:58:07] (03PS1) 10EoghanGaffney: lists: Add collaboration services as owner [puppet] - 10https://gerrit.wikimedia.org/r/1026157 (https://phabricator.wikimedia.org/T331706) [14:58:43] (03PS1) 10Hnowlan: k8s: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1026158 (https://phabricator.wikimedia.org/T36232) [14:58:53] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:19] jouncebot: nowandnext [15:00:19] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [15:00:19] In 1 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1700) [15:00:35] hi, I'm going to deploy the train to group1 in the next few minutes [15:00:38] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1026157 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:00:45] (03PS1) 10Hnowlan: trafficserver: move 80% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1026159 (https://phabricator.wikimedia.org/T362323) [15:02:17] (03PS1) 10Hnowlan: mw-we, mw-api-ext: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026160 (https://phabricator.wikimedia.org/T362323) [15:02:48] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:04:21] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026161 (https://phabricator.wikimedia.org/T361397) [15:04:23] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026161 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [15:05:07] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026161 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [15:07:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P61621 and previous config saved to /var/cache/conftool/dbconfig/20240501-150751-marostegui.json [15:13:25] (03PS2) 10Hnowlan: k8s: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1026158 (https://phabricator.wikimedia.org/T362323) [15:15:13] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7001.wikimedia.org with OS bookworm [15:15:18] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm exe... [15:15:23] PROBLEM - Check whether ferm is active by checking the default input chain on mw2406 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:22:36] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.3 refs T361397 [15:22:40] T361397: 1.43.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T361397 [15:22:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P61622 and previous config saved to /var/cache/conftool/dbconfig/20240501-152259-marostegui.json [15:38:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T361627)', diff saved to https://phabricator.wikimedia.org/P61623 and previous config saved to /var/cache/conftool/dbconfig/20240501-153806-marostegui.json [15:38:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:38:10] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:38:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:38:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T361627)', diff saved to https://phabricator.wikimedia.org/P61624 and previous config saved to /var/cache/conftool/dbconfig/20240501-153829-marostegui.json [15:39:47] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bookworm [15:40:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T361627)', diff saved to https://phabricator.wikimedia.org/P61625 and previous config saved to /var/cache/conftool/dbconfig/20240501-154042-marostegui.json [15:45:23] RECOVERY - Check whether ferm is active by checking the default input chain on mw2406 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:52:34] (03PS3) 10Jdrewniak: [Vector] Enable appearance menu and increased font-size on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [15:55:00] (03PS4) 10Jdrewniak: [Vector] Enable appearance menu and increased font-size on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [15:55:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P61626 and previous config saved to /var/cache/conftool/dbconfig/20240501-155552-marostegui.json [15:59:44] (03PS2) 10Hnowlan: mw-web, mw-api-ext: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026160 (https://phabricator.wikimedia.org/T362323) [16:00:45] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@09b4f5f]: Testing different settings for mediawiki_history_shapshot_config [16:01:13] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@09b4f5f]: Testing different settings for mediawiki_history_shapshot_config (duration: 00m 28s) [16:01:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:02:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:08] (03CR) 10Scott French: [C:03+1] k8s: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1026158 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [16:05:57] (03CR) 10Scott French: [C:03+1] mw-web, mw-api-ext: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026160 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [16:06:19] PROBLEM - Auth DNS on dns7001 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [16:06:23] PROBLEM - AuthDNS-over-TLS Works on dns7001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [16:07:04] ^ expected [16:07:10] (03CR) 10Scott French: [C:03+1] "Commit message nit: 80% -> 85%" [puppet] - 10https://gerrit.wikimedia.org/r/1026159 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [16:07:12] I don't want to silence this fo rnow [16:10:31] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cephosd1003.eqiad.wmnet with OS bookworm [16:11:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P61627 and previous config saved to /var/cache/conftool/dbconfig/20240501-161059-marostegui.json [16:11:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bookworm [16:15:14] (03PS1) 10Ssingh: P:dns::auth::update: add onlyif on authdns-loca-update [puppet] - 10https://gerrit.wikimedia.org/r/1026166 [16:16:30] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2222/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026166 (owner: 10Ssingh) [16:16:45] (03PS2) 10Ssingh: P:dns::auth::update: add onlyif on authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/1026166 [16:26:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T361627)', diff saved to https://phabricator.wikimedia.org/P61628 and previous config saved to /var/cache/conftool/dbconfig/20240501-162607-marostegui.json [16:26:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [16:26:10] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:26:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [16:26:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T361627)', diff saved to https://phabricator.wikimedia.org/P61629 and previous config saved to /var/cache/conftool/dbconfig/20240501-162629-marostegui.json [16:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T361627)', diff saved to https://phabricator.wikimedia.org/P61630 and previous config saved to /var/cache/conftool/dbconfig/20240501-162942-marostegui.json [16:31:26] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [16:34:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [16:38:11] FIRING: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:39:15] (03PS3) 10Ssingh: P:dns::auth::update: add unless on authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/1026166 [16:40:19] (03PS4) 10Ssingh: P:dns::auth::update: add unless on authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/1026166 [16:41:12] (03CR) 10BBlack: [C:03+1] P:dns::auth::update: add unless on authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/1026166 (owner: 10Ssingh) [16:41:25] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2224/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026166 (owner: 10Ssingh) [16:43:11] RESOLVED: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:43:49] !log sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1026166'" [16:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:12] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [16:44:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P61632 and previous config saved to /var/cache/conftool/dbconfig/20240501-164450-marostegui.json [16:44:54] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth::update: add unless on authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/1026166 (owner: 10Ssingh) [16:48:31] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9761520 (10VRiley-WMF) Hey @andrea.denisse we will schedule it once we have figured out a solution for wiping these drives as it seems like this has been a problem cropping up recently. [16:48:35] (03PS1) 10Ssingh: P:dns::update: specify /usr/bin/test instead of test [puppet] - 10https://gerrit.wikimedia.org/r/1026169 [16:51:54] (03CR) 10Ssingh: [C:03+2] P:dns::update: specify /usr/bin/test instead of test [puppet] - 10https://gerrit.wikimedia.org/r/1026169 (owner: 10Ssingh) [16:56:49] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:59:29] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [16:59:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1003.eqiad.wmnet with OS bookworm [16:59:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P61633 and previous config saved to /var/cache/conftool/dbconfig/20240501-165957-marostegui.json [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1700) [17:01:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:01:55] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 23 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:02:17] (03CR) 10JHathaway: "volans still interested in testing this?" [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway) [17:02:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:39] !log sudo cumin -b1 -s10 "A:dnsbox" "run-puppet-agent --enable 'merging CR 1026166'" [17:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:06] 06SRE, 06SRE Observability: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924 (10Scott_French) 03NEW [17:12:34] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bookworm [17:12:43] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761576 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm [17:14:46] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bookworm [17:15:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T361627)', diff saved to https://phabricator.wikimedia.org/P61634 and previous config saved to /var/cache/conftool/dbconfig/20240501-171504-marostegui.json [17:15:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [17:15:08] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:15:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [17:15:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T361627)', diff saved to https://phabricator.wikimedia.org/P61635 and previous config saved to /var/cache/conftool/dbconfig/20240501-171527-marostegui.json [17:16:14] (03PS5) 10Jdrewniak: [Vector] Enable appearance menu and increased font-size on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [17:20:11] PROBLEM - Host 2a02:ec80:700:1:195:200:68:5 is DOWN: CRITICAL - Host Unreachable (2a02:ec80:700:1:195:200:68:5) [17:21:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T361627)', diff saved to https://phabricator.wikimedia.org/P61636 and previous config saved to /var/cache/conftool/dbconfig/20240501-172059-marostegui.json [17:21:03] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:28:09] 2a02:ec80:700:1:195:200:68:5 is fine and expected [17:28:30] this is dns7001, not pooled for anything [17:35:51] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [17:36:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P61637 and previous config saved to /var/cache/conftool/dbconfig/20240501-173607-marostegui.json [17:38:27] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [17:43:46] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T363926 (10phaultfinder) 03NEW [17:46:25] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [17:49:08] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [17:49:15] (03CR) 10Stoyofuku-wmf: [Vector] Enable appearance menu and increased font-size on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [17:50:32] (03CR) 10Stoyofuku-wmf: [Vector] Enable appearance menu and increased font-size on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [17:51:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P61638 and previous config saved to /var/cache/conftool/dbconfig/20240501-175114-marostegui.json [17:53:50] PROBLEM - Recursive DNS on 195.200.68.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:59:50] PROBLEM - Recursive DNS on 2a02:ec80:700:1:195:200:68:5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:00:04] jnuche and brennen: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1800). [18:00:05] jnuche and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1800). nyaa~ [18:03:17] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1004.eqiad.wmnet with OS bookworm [18:04:05] (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good, thank you! Left a small comment, but it's optional" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [18:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T361627)', diff saved to https://phabricator.wikimedia.org/P61639 and previous config saved to /var/cache/conftool/dbconfig/20240501-180622-marostegui.json [18:06:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1236.eqiad.wmnet with reason: Maintenance [18:06:25] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:06:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1236.eqiad.wmnet with reason: Maintenance [18:06:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T361627)', diff saved to https://phabricator.wikimedia.org/P61640 and previous config saved to /var/cache/conftool/dbconfig/20240501-180645-marostegui.json [18:09:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T361627)', diff saved to https://phabricator.wikimedia.org/P61641 and previous config saved to /var/cache/conftool/dbconfig/20240501-180958-marostegui.json [18:11:48] RECOVERY - Recursive DNS on 195.200.68.5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:11:48] RECOVERY - Recursive DNS on 2a02:ec80:700:1:195:200:68:5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:14:15] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [18:15:23] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [18:15:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns7001.wikimedia.org with OS bookworm [18:15:33] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm com... [18:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:16:21] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bookworm [18:16:27] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761747 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm [18:19:14] PROBLEM - Host 2a02:ec80:700:2:195:200:68:37 is DOWN: CRITICAL - Host Unreachable (2a02:ec80:700:2:195:200:68:37) [18:19:24] ^ that's fine, reimaging, not pooled [18:19:26] PROBLEM - Host 195.200.68.37 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:30] ^ this as well [18:24:20] RECOVERY - Host 195.200.68.37 is UP: PING OK - Packet loss = 0%, RTA = 115.15 ms [18:25:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P61642 and previous config saved to /var/cache/conftool/dbconfig/20240501-182505-marostegui.json [18:26:24] PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:28:29] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [18:31:03] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761762 (10ssingh) [18:31:54] (03CR) 10Jdlrobson: [C:03+1] [Vector] Enable appearance menu and increased font-size on testwiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [18:35:21] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7002.magru.wmnet'] [18:35:36] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7002.magru.wmnet'] [18:35:42] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7002.magru.wmnet'] [18:36:24] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7002.magru.wmnet'] [18:36:26] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7002.wikimedia.org with OS bookworm [18:36:35] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm exe... [18:36:39] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bookworm [18:36:43] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9761774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm [18:40:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P61643 and previous config saved to /var/cache/conftool/dbconfig/20240501-184013-marostegui.json [18:40:25] (03PS6) 10Jdrewniak: [Vector] Enable appearance menu and increased font-size on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [18:42:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:46:34] (03CR) 10Stoyofuku-wmf: [Vector] Enable appearance menu and increased font-size on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [18:47:36] (03CR) 10Jdrewniak: [Vector] Enable appearance menu and increased font-size on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [18:51:44] (03CR) 10JHathaway: [C:03+2] postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [18:52:00] (03CR) 10JHathaway: [C:03+2] postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [18:52:09] (03CR) 10JHathaway: [C:03+2] postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [18:52:22] (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good, and has the added benefit of being equivalent to what we have in `CommonSettings-labs.php` for beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [18:55:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T361627)', diff saved to https://phabricator.wikimedia.org/P61644 and previous config saved to /var/cache/conftool/dbconfig/20240501-185521-marostegui.json [18:55:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:55:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:55:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:58:53] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:25] FIRING: [2x] SystemdUnitFailed: postfix@-.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:18] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [19:10:25] FIRING: [4x] SystemdUnitFailed: postfix@-.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:23] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [19:12:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:19:06] jouncebot: now [19:19:06] For the next 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T1800) [19:19:20] jouncebot: next [19:19:20] In 0 hour(s) and 40 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T2000) [19:19:34] (03PS1) 10JHathaway: mx-out: use puppet 7, again [puppet] - 10https://gerrit.wikimedia.org/r/1026177 (https://phabricator.wikimedia.org/T325398) [19:20:06] PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:20:12] (03CR) 10JHathaway: [C:03+2] mx-out: use puppet 7, again [puppet] - 10https://gerrit.wikimedia.org/r/1026177 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [19:25:25] FIRING: [4x] SystemdUnitFailed: postfix@-.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:06] PROBLEM - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:30:11] (03PS1) 10JHathaway: mx-out: use the puppet 7 acmechief host [puppet] - 10https://gerrit.wikimedia.org/r/1026178 (https://phabricator.wikimedia.org/T325398) [19:30:25] FIRING: [4x] SystemdUnitFailed: postfix@-.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:31:06] (03CR) 10JHathaway: [C:03+2] mx-out: use the puppet 7 acmechief host [puppet] - 10https://gerrit.wikimedia.org/r/1026178 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [19:32:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:36:04] RECOVERY - Recursive DNS on 195.200.68.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:36:04] RECOVERY - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:39:36] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [19:40:41] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [19:40:41] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns7002.wikimedia.org with OS bookworm [19:40:50] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9762008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm com... [19:42:23] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9762017 (10ssingh) [19:43:53] FIRING: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:45:27] RESOLVED: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:37] (03PS1) 10Herron: alertmanager: irc: tweak whitespace for single alerts [puppet] - 10https://gerrit.wikimedia.org/r/1026180 (https://phabricator.wikimedia.org/T362239) [19:49:12] (03CR) 10Herron: [C:03+2] alertmanager: irc: tweak whitespace for single alerts [puppet] - 10https://gerrit.wikimedia.org/r/1026180 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T2000) [20:00:05] jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:45] Guess it's just me today, I can self-deploy [20:02:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [20:02:47] (03Merged) 10jenkins-bot: [Vector] Enable appearance menu and increased font-size on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025878 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdlrobson) [20:03:17] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1025878|[Vector] Enable appearance menu and increased font-size on testwiki (T362147)]] [20:03:20] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:03:44] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 12), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9762052 (10Scott_French) Thanks, @SGupta-WMF - I'll keep an eye T362697 as wel... [20:08:17] !log jdrewniak@deploy1002 jdlrobson and jdrewniak: Backport for [[gerrit:1025878|[Vector] Enable appearance menu and increased font-size on testwiki (T362147)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:22] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:08:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:10:36] !log jdrewniak@deploy1002 jdlrobson and jdrewniak: Continuing with sync [20:16:01] (03PS1) 10JHathaway: mx-out: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) [20:16:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [20:16:20] (03CR) 10CI reject: [V:04-1] mx-out: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [20:18:58] (03PS2) 10JHathaway: mx-out: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) [20:19:53] (03PS3) 10JHathaway: mx-out: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) [20:20:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [20:22:47] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1025878|[Vector] Enable appearance menu and increased font-size on testwiki (T362147)]] (duration: 19m 29s) [20:22:51] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:23:21] (03CR) 10JHathaway: [C:03+2] mx-out: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1026182 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [20:24:59] Hey every, scap finished but it says it produced an error... [20:25:05] https://www.irccloud.com/pastebin/Tos7haU6/ [20:25:53] but it looks like the config change was successful, (as far as I can tell) [20:27:48] FIRING: PuppetFailure: Puppet has failed on mx-out1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:28:10] jhathaway: fyi ^ [20:28:22] rzl: thanks [20:32:48] FIRING: [2x] PuppetFailure: Puppet has failed on mx-out1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:44:30] (03PS2) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [20:44:50] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [20:49:33] (03PS3) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [20:50:02] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [20:57:16] (03PS4) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [20:57:44] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [21:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240501T2100) [21:00:49] (03PS5) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [21:01:19] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [21:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:07:49] FIRING: [2x] PuppetFailure: Puppet has failed on mx-out1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:12:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on mx-out1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:23:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:36:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:36:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:42:14] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:13:03] 06SRE, 06SRE Observability: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9762323 (10Scott_French) [22:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:22:47] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 13), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9762392 (10VirginiaPoundstone) [22:27:26] 06SRE, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9762416 (10Dzahn) A list of errors that show up with a fresh deployment server on bullseye so far: error: E: Unable to locate package python-redis fix: [[ https://gerri... [22:36:39] 06SRE, 06serviceops: deployment_server bullseye - mw-cgroup.service: Failed - https://phabricator.wikimedia.org/T363957 (10Dzahn) 03NEW [22:37:42] (03CR) 10Dzahn: "I found this when trying to run a deployment_server on bullseye in production and I got an error about the mw-cgroup service, reported as " [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) (owner: 10Muehlenhoff) [22:39:06] 06SRE, 06serviceops: deployment_server bullseye - mw-cgroup.service: Failed - https://phabricator.wikimedia.org/T363957#9762508 (10Dzahn) I now found T325228#9445729 which seems like the exact same issue on snapshot hosts. [22:42:31] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9762512 (10Dzahn) >>! In T325228#9445729, @MoritzMuehlenhoff wrote: > .. The setup of the mw-cgroup (configured via... [22:49:46] 06SRE, 06serviceops: deployment_server bullseye - mw-cgroup.service: Failed - https://phabricator.wikimedia.org/T363957#9762525 (10Dzahn) I rebooted the VM and the issue went away! The grub config from the change above was applied apparently: ` root@deploy-1006:/# grep -Eo systemd.unified_cgroup_hierarchy=0... [22:51:13] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9762527 (10Dzahn) update: rebooting the VM fixed the problem because then the grub config was applied: T363957#97625... [22:51:44] 06SRE, 06serviceops: deployment_server bullseye - mw-cgroup.service: Failed - https://phabricator.wikimedia.org/T363957#9762528 (10Dzahn) 05Open→03Resolved a:03Dzahn [22:58:53] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:19:24] (03PS1) 10Dzahn: geoip - test commit [puppet] - 10https://gerrit.wikimedia.org/r/1026192 [23:20:51] (03CR) 10Dzahn: [C:04-2] geoip - test commit [puppet] - 10https://gerrit.wikimedia.org/r/1026192 (owner: 10Dzahn) [23:30:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025911 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025911 (owner: 10TrainBranchBot) [23:44:27] (03PS1) 10Dzahn: mediawiki/geoip: make it optional to load geoip data from puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [23:44:47] (03CR) 10CI reject: [V:04-1] mediawiki/geoip: make it optional to load geoip data from puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [23:44:58] (03Abandoned) 10Dzahn: geoip - test commit [puppet] - 10https://gerrit.wikimedia.org/r/1026192 (owner: 10Dzahn) [23:47:31] (03PS2) 10Dzahn: mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [23:47:50] (03CR) 10CI reject: [V:04-1] mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [23:51:44] 06SRE, 06serviceops, 13Patch-For-Review: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9762603 (10Dzahn) [23:58:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025911 (owner: 10TrainBranchBot) [23:59:31] (03PS1) 10Eevans: New group for users of Cassandra staging (cassandra-dev) [puppet] - 10https://gerrit.wikimedia.org/r/1026194 (https://phabricator.wikimedia.org/T355730)