[00:02:05] !incidents [00:02:06] 5578 (UNACKED) [5x] ProbeDown sre (probes/service eqsin) [00:02:09] !ack 5578 [00:02:09] 5578 (ACKED) [5x] ProbeDown sre (probes/service eqsin) [00:03:10] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin for service: ncredir-addrs [reason: no reason specified, no task ID specified] [00:03:23] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site eqsin for service: ncredir-addrs [reason: no reason specified, no task ID specified] [00:03:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:03:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:04:19] FIRING: [4x] ProbeDown: Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:27] yeah don't think it's a good idea since it's definitely more than ncredir-addrs here [00:04:58] FIRING: [9x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:24] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp5018.eqsin.wmnet, cp5023.eqsin.wmnet are marked down but pooled: testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet are marked down but pooled: testlb6_80: Servers cp5024.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: ncredirlb6_80: Servers ncredir5002.eqsin.wmnet are marked down but poole [00:05:24] irlb_443: Servers ncredir5002.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet [00:05:25] .eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5 https://wikitech.wikimedia.org/wiki/PyBal [00:06:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [00:06:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from JP) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [00:07:14] FIRING: [6x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:16] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:08:42] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:08:48] !incidents [00:08:48] 5578 (ACKED) [5x] ProbeDown sre (probes/service eqsin) [00:08:49] 5579 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [00:08:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:08:55] !ack 5579 [00:08:56] 5579 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [00:09:19] RESOLVED: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:58] RESOLVED: [7x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [00:11:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from JP) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [00:11:37] o/ [00:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426065 (10phaultfinder) [00:30:09] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107568 (owner: 10TrainBranchBot) [00:38:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107757 [00:38:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107757 (owner: 10TrainBranchBot) [00:55:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107757 (owner: 10TrainBranchBot) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107569 (owner: 10TrainBranchBot) [01:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107758 [01:07:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107758 (owner: 10TrainBranchBot) [01:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426087 (10phaultfinder) [01:25:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107758 (owner: 10TrainBranchBot) [01:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/c1efcce25c560cee9dd502b4c05ba3d5ab33d81219080fe4bf8e79269b917336/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:59:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426094 (10phaultfinder) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426102 (10phaultfinder) [04:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426111 (10phaultfinder) [04:50:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:55:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:02:16] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:18] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:09:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426122 (10phaultfinder) [05:18:48] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1106302 (owner: 10L10n-bot) [05:20:55] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1107327 (owner: 10L10n-bot) [05:24:25] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1106920 (owner: 10L10n-bot) [05:24:34] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1106305 (owner: 10L10n-bot) [05:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426126 (10phaultfinder) [05:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:25] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1107330 (owner: 10L10n-bot) [05:32:02] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1106917 (owner: 10L10n-bot) [05:35:29] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1107330 (owner: 10L10n-bot) [05:41:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:51:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:59:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:04:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426127 (10phaultfinder) [06:29:30] (03PS1) 10Marostegui: db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107882 (https://phabricator.wikimedia.org/T382744) [06:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426130 (10phaultfinder) [06:29:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2123.codfw.wmnet with reason: maintenance [06:29:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2123.codfw.wmnet with reason: maintenance [06:30:01] (03CR) 10Marostegui: [C:03+2] db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107882 (https://phabricator.wikimedia.org/T382744) (owner: 10Marostegui) [06:32:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2171 to clone db2123 T382744', diff saved to https://phabricator.wikimedia.org/P71750 and previous config saved to /var/cache/conftool/dbconfig/20250102-063218-marostegui.json [06:32:21] T382744: mysql crash on db2123 - https://phabricator.wikimedia.org/T382744 [06:32:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: maintenance [06:32:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: maintenance [06:37:07] !log root@cumin1002 START - Cookbook sre.mysql.clone of db2171.codfw.wmnet onto db2123.codfw.wmnet [06:40:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10426139 (10Marostegui) p:05Triage→03Medium [06:42:54] (03PS1) 10Marostegui: dbproxy1028: Update hosts [puppet] - 10https://gerrit.wikimedia.org/r/1107883 (https://phabricator.wikimedia.org/T368874) [06:43:35] (03PS2) 10Marostegui: dbproxy1028: Update hosts [puppet] - 10https://gerrit.wikimedia.org/r/1107883 (https://phabricator.wikimedia.org/T368874) [06:44:19] (03CR) 10Marostegui: [C:03+2] dbproxy1028: Update hosts [puppet] - 10https://gerrit.wikimedia.org/r/1107883 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [06:50:34] (03PS1) 10Marostegui: production-m3.sql.erb: Replace dbproxy1020 with dbproxy1028 [puppet] - 10https://gerrit.wikimedia.org/r/1107884 (https://phabricator.wikimedia.org/T368874) [06:51:14] (03CR) 10Marostegui: "This is a NOOP and grants were applied on the database live already. dbproxy1028 is showing hosts as UP now." [puppet] - 10https://gerrit.wikimedia.org/r/1107884 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [06:53:05] (03CR) 10Marostegui: [C:03+2] production-m3.sql.erb: Replace dbproxy1020 with dbproxy1028 [puppet] - 10https://gerrit.wikimedia.org/r/1107884 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [07:13:00] (03PS1) 10Marostegui: production-m5.sql.erb: Replace dbproxy1021 with dbproxy1029 [puppet] - 10https://gerrit.wikimedia.org/r/1107887 (https://phabricator.wikimedia.org/T368874) [07:13:32] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2171.codfw.wmnet onto db2123.codfw.wmnet [07:14:17] (03PS1) 10Marostegui: report_users.sh: Add dbproxy102[89] [software] - 10https://gerrit.wikimedia.org/r/1107888 (https://phabricator.wikimedia.org/T368874) [07:16:00] (03CR) 10Marostegui: [C:03+2] report_users.sh: Add dbproxy102[89] [software] - 10https://gerrit.wikimedia.org/r/1107888 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [07:16:28] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy102[89] [software] - 10https://gerrit.wikimedia.org/r/1107888 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [07:17:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71751 and previous config saved to /var/cache/conftool/dbconfig/20250102-071700-root.json [07:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71752 and previous config saved to /var/cache/conftool/dbconfig/20250102-073206-root.json [07:32:52] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867 (10Marostegui) 03NEW [07:32:54] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426181 (10Marostegui) a:03Marostegui [07:33:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dbproxy2001.codfw.wmnet with reason: maintenance [07:33:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbproxy2001.codfw.wmnet with reason: maintenance [07:33:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dbproxy2002.codfw.wmnet with reason: maintenance [07:33:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbproxy2002.codfw.wmnet with reason: maintenance [07:33:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dbproxy2003.codfw.wmnet with reason: maintenance [07:33:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbproxy2003.codfw.wmnet with reason: maintenance [07:34:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dbproxy2004.codfw.wmnet with reason: maintenance [07:34:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbproxy2004.codfw.wmnet with reason: maintenance [07:35:11] !log Stop haproxy on dbproxy200[14] T381962 [07:35:13] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426184 (10Marostegui) [07:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:14] T381962: Decommission dbproxy200[1-4] - https://phabricator.wikimedia.org/T381962 [07:36:28] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426187 (10Marostegui) [07:38:48] (03PS1) 10Marostegui: production-m1.sql.erb: Replace dbproxy2001 with dbproxy2005 [puppet] - 10https://gerrit.wikimedia.org/r/1107917 [07:40:14] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1107917 (owner: 10Marostegui) [07:40:28] (03CR) 10Marostegui: [C:03+2] production-m5.sql.erb: Replace dbproxy1021 with dbproxy1029 [puppet] - 10https://gerrit.wikimedia.org/r/1107887 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [07:41:00] (03CR) 10Marostegui: [C:03+2] production-m1.sql.erb: Replace dbproxy2001 with dbproxy2005 [puppet] - 10https://gerrit.wikimedia.org/r/1107917 (owner: 10Marostegui) [07:47:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71753 and previous config saved to /var/cache/conftool/dbconfig/20250102-074711-root.json [07:57:36] (03PS1) 10Marostegui: mariadb: Decommission dbproxy2001 [puppet] - 10https://gerrit.wikimedia.org/r/1107919 (https://phabricator.wikimedia.org/T382867) [07:58:27] (03PS1) 10Marostegui: report_users.sh: Remove dbproxy2001 [software] - 10https://gerrit.wikimedia.org/r/1107920 (https://phabricator.wikimedia.org/T382867) [07:59:02] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426197 (10Marostegui) [07:59:15] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426198 (10Marostegui) [07:59:22] (03CR) 10Marostegui: [C:03+2] report_users.sh: Remove dbproxy2001 [software] - 10https://gerrit.wikimedia.org/r/1107920 (https://phabricator.wikimedia.org/T382867) (owner: 10Marostegui) [07:59:48] (03Merged) 10jenkins-bot: report_users.sh: Remove dbproxy2001 [software] - 10https://gerrit.wikimedia.org/r/1107920 (https://phabricator.wikimedia.org/T382867) (owner: 10Marostegui) [08:00:06] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T0800). nyaa~ [08:00:06] hubaishan and DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy2001.codfw.wmnet [08:02:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71754 and previous config saved to /var/cache/conftool/dbconfig/20250102-080216-root.json [08:04:34] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission dbproxy2001 [puppet] - 10https://gerrit.wikimedia.org/r/1107919 (https://phabricator.wikimedia.org/T382867) (owner: 10Marostegui) [08:04:49] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [08:05:52] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426201 (10Marostegui) [08:08:20] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:08:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:08:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:08:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy2001.codfw.wmnet [08:08:50] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426202 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `dbproxy2001.codfw.wmnet` -... [08:09:11] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426203 (10Marostegui) [08:09:21] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426206 (10Marostegui) This is ready for #dc-ops [08:17:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71755 and previous config saved to /var/cache/conftool/dbconfig/20250102-081722-root.json [08:26:54] (03PS1) 10Marostegui: db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107921 [08:27:28] (03CR) 10Marostegui: [C:03+2] db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107921 (owner: 10Marostegui) [08:28:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 1%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71756 and previous config saved to /var/cache/conftool/dbconfig/20250102-082806-root.json [08:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426214 (10phaultfinder) [08:43:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71757 and previous config saved to /var/cache/conftool/dbconfig/20250102-084312-root.json [08:46:38] (03PS1) 10Marostegui: production-m2.sql.erb: Replace dbproxy2002 with dbproxy2006 [puppet] - 10https://gerrit.wikimedia.org/r/1107922 (https://phabricator.wikimedia.org/T381962) [08:46:57] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1107922 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [08:48:42] (03CR) 10Marostegui: [C:03+2] production-m2.sql.erb: Replace dbproxy2002 with dbproxy2006 [puppet] - 10https://gerrit.wikimedia.org/r/1107922 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [08:50:17] (03PS1) 10Marostegui: report_users.sh: Remove dbproxy2002 [software] - 10https://gerrit.wikimedia.org/r/1107923 (https://phabricator.wikimedia.org/T381962) [08:52:06] (03PS1) 10Marostegui: mariadb: Remove dbproxy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1107924 (https://phabricator.wikimedia.org/T382868) [08:52:07] (03CR) 10Marostegui: [C:03+2] report_users.sh: Remove dbproxy2002 [software] - 10https://gerrit.wikimedia.org/r/1107923 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [08:52:41] (03Merged) 10jenkins-bot: report_users.sh: Remove dbproxy2002 [software] - 10https://gerrit.wikimedia.org/r/1107923 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [08:53:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy2002.codfw.wmnet [08:57:38] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [08:58:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71758 and previous config saved to /var/cache/conftool/dbconfig/20250102-085816-root.json [08:58:38] is there a chance to deploy  scheduled patches? [09:00:08] hubaishan: The deployers may be on vacation... [09:01:17] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:01:30] (03PS2) 10Anzx: bjnwikiquote: add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) [09:01:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:01:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:01:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy2002.codfw.wmnet [09:01:36] (03CR) 10Marostegui: [C:03+2] mariadb: Remove dbproxy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1107924 (https://phabricator.wikimedia.org/T382868) (owner: 10Marostegui) [09:01:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426243 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `dbproxy2002.codfw.wmnet` - dbproxy2002.... [09:01:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx) [09:02:13] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2002.codfw.wmnet - https://phabricator.wikimedia.org/T382868#10426257 (10Marostegui) a:05Marostegui→03None [09:02:21] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2002.codfw.wmnet - https://phabricator.wikimedia.org/T382868#10426262 (10Marostegui) This is ready for #dc-ops [09:09:19] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:13:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71759 and previous config saved to /var/cache/conftool/dbconfig/20250102-091322-root.json [09:14:51] RECOVERY - ganeti-wconfd running on ganeti4008 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:17:52] PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426265 (10phaultfinder) [09:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71760 and previous config saved to /var/cache/conftool/dbconfig/20250102-092827-root.json [09:31:20] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetConstantChange (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T382870 (10LSobanski) 03NEW [09:32:14] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T382871 (10LSobanski) 03NEW [09:43:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71761 and previous config saved to /var/cache/conftool/dbconfig/20250102-094332-root.json [09:47:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [09:47:43] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [09:48:45] 06SRE, 06Infrastructure-Foundations: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873 (10cmooney) 03NEW p:05Triage→03Medium [09:50:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874 (10MatthewVernon) 03NEW [09:51:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10426320 (10MatthewVernon) p:05Triage→03High [09:56:06] !log restart swift-object on ms-be1075 [09:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71762 and previous config saved to /var/cache/conftool/dbconfig/20250102-095838-root.json [09:58:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10426336 (10Marostegui) a:05Marostegui→03None [10:01:33] (03PS1) 10Marostegui: report_users.sh: Remove dbproxy2003 [software] - 10https://gerrit.wikimedia.org/r/1107925 (https://phabricator.wikimedia.org/T381962) [10:02:11] 06SRE, 06Infrastructure-Foundations: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873#10426340 (10cmooney) Checking with curl form ganeti4008 to ganeti4006 on TCP 1811 this is reported: ` * Server certificate: * subject: CN=ganeti.example.com * start date: Dec 17 19:12:2... [10:02:24] (03PS1) 10Marostegui: production-m3.sql.erb: Replace dbproxy2003 with dbproxy2007 [puppet] - 10https://gerrit.wikimedia.org/r/1107926 (https://phabricator.wikimedia.org/T381962) [10:02:35] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1107926 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:02:49] (03CR) 10Marostegui: [C:03+2] report_users.sh: Remove dbproxy2003 [software] - 10https://gerrit.wikimedia.org/r/1107925 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:03:16] (03Merged) 10jenkins-bot: report_users.sh: Remove dbproxy2003 [software] - 10https://gerrit.wikimedia.org/r/1107925 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:05:05] (03CR) 10Marostegui: [C:03+2] production-m3.sql.erb: Replace dbproxy2003 with dbproxy2007 [puppet] - 10https://gerrit.wikimedia.org/r/1107926 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:06:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy2003.codfw.wmnet [10:07:50] (03PS1) 10Marostegui: mariadb: Decommission dbproxy2003 [puppet] - 10https://gerrit.wikimedia.org/r/1107927 (https://phabricator.wikimedia.org/T382875) [10:10:34] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [10:11:17] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, 10Move-Files-To-Commons: Error using FileImporter and undelete file on Commons because of "local-multiwrite/local-public...is in an inconsistent state within the inte... - https://phabricator.wikimedia.org/T382715#10426350 [10:14:19] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [10:14:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [10:14:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:14:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy2003.codfw.wmnet [10:15:38] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, 10Move-Files-To-Commons: Error using FileImporter and undelete file on Commons because of "local-multiwrite/local-public...is in an inconsistent state within the inte... - https://phabricator.wikimedia.org/T382715#10426353 [10:15:45] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission dbproxy2003 [puppet] - 10https://gerrit.wikimedia.org/r/1107927 (https://phabricator.wikimedia.org/T382875) (owner: 10Marostegui) [10:16:21] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2003.codfw.wmnet - https://phabricator.wikimedia.org/T382875#10426355 (10Marostegui) a:05Marostegui→03None [10:16:28] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2003.codfw.wmnet - https://phabricator.wikimedia.org/T382875#10426360 (10Marostegui) Ready for #dc-ops [10:29:20] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426367 (10phaultfinder) [10:33:59] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Some files uploaded on 2024-12-23 not found on upload.wikimedia.org - https://phabricator.wikimedia.org/T382765#10426371 (10MatthewVernon) A bunch of different things here, taking them in order: # looks OK now # file page looks OK (and the fil... [10:36:16] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Media storage error with the re-uploading file in Commons - https://phabricator.wikimedia.org/T382764#10426373 (10MatthewVernon) I'm sorry, but I'm confused by this report. There appear to have been two versions of this image uploaded, and as far a... [10:40:43] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Media storage error with the re-uploading file in Commons - https://phabricator.wikimedia.org/T382764#10426399 (10Aklapper) 05Open→03Resolved @Kaganer: For future reference, please fill in the sections of the bug report form template to avo... [10:41:31] (03PS1) 10Marostegui: report_users.sh: Remove dbproxy2004 [software] - 10https://gerrit.wikimedia.org/r/1107931 (https://phabricator.wikimedia.org/T381962) [10:41:32] (03PS1) 10Marostegui: production-m5.sql.erb: Replace dbproxy2004 with dbproxy2008 [puppet] - 10https://gerrit.wikimedia.org/r/1107930 (https://phabricator.wikimedia.org/T381962) [10:41:41] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1107930 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:42:12] (03CR) 10Marostegui: [C:03+2] report_users.sh: Remove dbproxy2004 [software] - 10https://gerrit.wikimedia.org/r/1107931 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:42:52] (03Merged) 10jenkins-bot: report_users.sh: Remove dbproxy2004 [software] - 10https://gerrit.wikimedia.org/r/1107931 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:46:21] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: deploy instances from a single configuration [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [10:47:02] (03CR) 10Marostegui: [C:03+2] production-m5.sql.erb: Replace dbproxy2004 with dbproxy2008 [puppet] - 10https://gerrit.wikimedia.org/r/1107930 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [10:48:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy2004.codfw.wmnet [10:49:24] (03PS1) 10Marostegui: mariadb: Decommission dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/1107932 (https://phabricator.wikimedia.org/T382877) [10:51:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer) [10:52:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [10:52:53] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [10:56:11] !log dbmaint codfw Decommissioned dbproxy200[1-4] T381962 [10:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:14] T381962: Decommission dbproxy200[1-4] - https://phabricator.wikimedia.org/T381962 [10:56:25] !log dbmaint codfw Decommissioned dbproxy200[1-4] m1 m2 m3 m5 T381962 [10:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:27] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [10:56:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [10:56:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy2004.codfw.wmnet [10:56:56] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/1107932 (https://phabricator.wikimedia.org/T382877) (owner: 10Marostegui) [10:57:58] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2004.codfw.wmnet - https://phabricator.wikimedia.org/T382877#10426436 (10Marostegui) a:05Marostegui→03None [10:58:05] 10ops-codfw, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2004.codfw.wmnet - https://phabricator.wikimedia.org/T382877#10426441 (10Marostegui) This is ready for #dc-ops [10:58:58] PROBLEM - HTTPS Ganeti RAPI eqsin on ganeti5004 is CRITICAL: connect to address ganeti01.svc.eqsin.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [10:59:08] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:59:08] PROBLEM - ganeti-noded running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:59:10] PROBLEM - ganeti-confd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1100) [11:00:08] RECOVERY - ganeti-noded running on ganeti5004 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:00:09] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T381993 [11:00:10] RECOVERY - ganeti-confd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:00:12] T381993: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T381993 [11:00:23] ^ that's me, I'm trying to extend the ganeti cert for eqsin, but it [11:00:28] ^ that's me, I'm trying to extend the ganeti cert for eqsin, but it's running into an error [11:00:40] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T381993 [11:01:02] RECOVERY - HTTPS Ganeti RAPI eqsin on ganeti5004 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:01:08] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:02:15] (03PS9) 10Tiziano Fogli: ripeatlas: remove hardcoded measurements [alerts] - 10https://gerrit.wikimedia.org/r/1105747 (https://phabricator.wikimedia.org/T370506) [11:02:16] (03CR) 10Tiziano Fogli: "This is ready for review." [alerts] - 10https://gerrit.wikimedia.org/r/1105747 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [11:02:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1209 with weight 0 T381993', diff saved to https://phabricator.wikimedia.org/P71763 and previous config saved to /var/cache/conftool/dbconfig/20250102-110232-root.json [11:03:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1209 from API/vslow/dump T381993', diff saved to https://phabricator.wikimedia.org/P71764 and previous config saved to /var/cache/conftool/dbconfig/20250102-110305-root.json [11:08:22] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1102333 (https://phabricator.wikimedia.org/T381993) [11:08:43] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1102333 (https://phabricator.wikimedia.org/T381993) (owner: 10Gerrit maintenance bot) [11:09:03] !log Starting s8 eqiad failover from db1193 to db1209 - T381993 [11:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:06] T381993: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T381993 [11:09:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1209 to s8 primary T381993', diff saved to https://phabricator.wikimedia.org/P71765 and previous config saved to /var/cache/conftool/dbconfig/20250102-110923-marostegui.json [11:11:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1193 T381993', diff saved to https://phabricator.wikimedia.org/P71766 and previous config saved to /var/cache/conftool/dbconfig/20250102-111105-marostegui.json [11:14:04] (03CR) 10Hashar: [C:04-1] "The fault comes from 22c4cd340c881165345fbc2740ae55e5bdf33fac _Fix sudo docker-pkg_. When I revert that commit in my local copy of dev-ima" [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes) [11:15:45] (03CR) 10Hashar: [C:04-1] "> The fault comes from 22c4cd340c881165345fbc2740ae55e5bdf33fac _Fix sudo docker-pkg_. When I revert that commit in my local copy of dev-i" [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes) [11:22:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on db1193.eqiad.wmnet with reason: maintenance [11:22:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on db1193.eqiad.wmnet with reason: maintenance [11:22:56] (03PS1) 10Marostegui: db1193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107933 [11:24:38] (03CR) 10Marostegui: [C:03+2] db1193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107933 (owner: 10Marostegui) [11:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:58] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum three container dbs [11:29:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum three container dbs [11:29:18] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10426495 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cfd9f3f7-948f-45c9-9f34-d03cdafe30cc) set by mvernon@cumin... [11:31:33] !log dbmaint s8 db1193 eqiad rebuild pagelinks and recentchanges and deploy schema change on revision table T367856 T382842 [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:31:37] T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842 [11:48:50] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10426558 (10MatthewVernon) Ms-be1066 was more problematic - it had no 7G containers, but it did have 20 4G ones; I vacuumed the three b... [11:49:06] !log mvernon@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1066.eqiad.wmnet [11:49:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1066.eqiad.wmnet [11:52:27] (03PS1) 10Muehlenhoff: Manage Ganeti known_hosts in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1107935 [11:58:22] (03PS2) 10Muehlenhoff: Manage Ganeti known_hosts in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1107935 [11:59:03] (03PS1) 10Novem Linguae: enable 2 factor authentication for enwiki page movers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) [11:59:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1107935 (owner: 10Muehlenhoff) [12:05:36] (03CR) 10Muehlenhoff: [C:03+2] Manage Ganeti known_hosts in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1107935 (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1300) [13:03:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:19:43] PROBLEM - HTTPS Ganeti RAPI eqsin on ganeti5004 is CRITICAL: connect to address ganeti01.svc.eqsin.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [13:20:43] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:24:43] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:25:43] RECOVERY - HTTPS Ganeti RAPI eqsin on ganeti5004 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.016 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [13:27:43] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:31:45] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:33:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:45] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:43:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:45] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:46:45] PROBLEM - HTTPS Ganeti RAPI eqsin on ganeti5004 is CRITICAL: connect to address ganeti01.svc.eqsin.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [13:46:49] (03PS1) 10Muehlenhoff: Revert "Manage Ganeti known_hosts in eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1107948 [13:47:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:47:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:47:45] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:47:45] PROBLEM - ganeti-noded running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:47:45] PROBLEM - ganeti-confd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:48:31] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:19] FIRING: ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:45] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:49:45] RECOVERY - ganeti-noded running on ganeti5004 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:49:45] RECOVERY - ganeti-confd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:50:45] RECOVERY - HTTPS Ganeti RAPI eqsin on ganeti5004 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [13:51:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:14] RESOLVED: ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:45] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:56:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1400). Please do the needful. [14:00:05] anzx, hubaishan, and DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:34] * anzx o/ [14:01:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:45] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:03:07] \o/ [14:03:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:29] o/ [14:04:45] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:06:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:12] is there any deployers? [14:13:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:10] no one? [14:17:45] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:18:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:47] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426753 (10phaultfinder) [14:26:26] (03PS1) 10Marostegui: instances.yaml: Remove db2116 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1107949 (https://phabricator.wikimedia.org/T362950) [14:26:58] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2116 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1107949 (https://phabricator.wikimedia.org/T362950) (owner: 10Marostegui) [14:28:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2116 from dbctl T362950', diff saved to https://phabricator.wikimedia.org/P71768 and previous config saved to /var/cache/conftool/dbconfig/20250102-142806-marostegui.json [14:28:11] T362950: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950 [14:28:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:01] (03PS1) 10Marostegui: db2116: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107950 (https://phabricator.wikimedia.org/T362950) [14:29:29] (03CR) 10Marostegui: [C:03+2] db2116: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1107950 (https://phabricator.wikimedia.org/T362950) (owner: 10Marostegui) [14:29:47] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:30:21] (03CR) 10Ssingh: [C:03+1] ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [14:33:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:47] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:34:47] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:36] (03PS3) 10FNegri: Allow pty allocation for cumin ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) [14:39:13] (03CR) 10FNegri: Allow pty allocation for cumin ssh keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [14:42:47] PROBLEM - HTTPS Ganeti RAPI eqsin on ganeti5004 is CRITICAL: connect to address ganeti01.svc.eqsin.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [14:42:47] PROBLEM - ganeti-confd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:42:47] PROBLEM - ganeti-noded running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:43:47] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:44:19] FIRING: ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:47] RECOVERY - ganeti-confd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:45:47] RECOVERY - ganeti-noded running on ganeti5004 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:46:47] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:47:14] RESOLVED: ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:47] RECOVERY - HTTPS Ganeti RAPI eqsin on ganeti5004 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [14:49:43] (03CR) 10Muehlenhoff: [C:03+2] Revert "Manage Ganeti known_hosts in eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1107948 (owner: 10Muehlenhoff) [14:49:47] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:52:02] (03PS1) 10AOkoth: kubectl: image with kubectl installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) [14:54:44] (03CR) 10Ssingh: "/etc/powerdns/extrarecursorhosts is created on the DNS hosts as well, see inline comments." [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [14:58:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:40] (03PS4) 10Ladsgroup: mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 [15:14:47] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [15:17:55] RECOVERY - ganeti-wconfd running on ganeti5004 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:18:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:22] (03CR) 10Ssingh: [V:03+1] trafficserver: explicitly specify user/group for systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [15:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:27] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:29] (03CR) 10Alexandros Kosiaris: [C:04-1] "No need for downloading from k8s.io (and it wouldn't work anyway), we have the kubernetes-client package in our apt repos and we are guara" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:36:26] (03CR) 10Ssingh: [C:03+1] "Looks good! (Nit: perhaps we should add a quick link to the "bug fixed" here for posterity.)" [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:36:56] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [15:37:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1107955 (https://phabricator.wikimedia.org/T382900) [15:39:52] (03CR) 10Ssingh: "[might be helpful to run PCC on cp1100 and some other host as well.]" [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:40:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10426999 (10phaultfinder) [15:41:44] (03Abandoned) 10Ladsgroup: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1107955 (https://phabricator.wikimedia.org/T382900) (owner: 10Gerrit maintenance bot) [15:42:15] (03Abandoned) 10Ladsgroup: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1070873 (https://phabricator.wikimedia.org/T374088) (owner: 10Gerrit maintenance bot) [15:42:45] (03Abandoned) 10Ladsgroup: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1040886 (https://phabricator.wikimedia.org/T367020) (owner: 10Gerrit maintenance bot) [15:43:00] (03Abandoned) 10Ladsgroup: Add zh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) (owner: 10Gerrit maintenance bot) [15:49:43] (03PS2) 10AOkoth: kubectl: image with kubectl installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) [15:52:13] (03CR) 10AOkoth: "Acknowledged." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:55:45] 06SRE, 06Infrastructure-Foundations: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873#10427027 (10MoritzMuehlenhoff) It seems we lost monitoring for the internal Ganeti cert expiry in the whole Icinga->AM work? This was definitely alerted for in the past. I'll open a separ... [15:58:10] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Some files uploaded on 2024-12-23 not found on upload.wikimedia.org - https://phabricator.wikimedia.org/T382765#10427035 (10Pppery) 2 and 4 were deleted on Commons and then recreated as redirects (so there's definitely no Swift issue there anymore)... [15:59:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: (Re-)Add monitoring for the internal Ganeti certs - https://phabricator.wikimedia.org/T382902 (10MoritzMuehlenhoff) 03NEW [15:59:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: (Re-)Add monitoring for the internal Ganeti certs - https://phabricator.wikimedia.org/T382902#10427053 (10MoritzMuehlenhoff) p:05Triage→03High [16:00:05] Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1600) [16:00:15] no train no triage! [16:09:05] happy2025, hashar! just for form, I did pop into the train log triage meeting for a minute to say I'd been (and also to say Happy 2025 in case anyone else was there). :-) [16:09:26] 06SRE, 06Infrastructure-Foundations: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873#10427066 (10cmooney) [16:12:07] apergos: ahhh :) [16:12:22] apergos: happy new year! :) [16:12:32] ;-) [16:12:59] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Some files uploaded on 2024-12-23 not found on upload.wikimedia.org - https://phabricator.wikimedia.org/T382765#10427072 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Right, yes, so I'm going to close this ticket - I suggest the p... [16:22:11] (03CR) 10Ssingh: [C:03+1] "Looks good but I am curious to know why you are changing this. network-timeout here refers to the timeout waiting for an auth server to re" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [16:23:49] (03CR) 10Ssingh: "Let's rebase this and plan to merge this!" [dns] - 10https://gerrit.wikimedia.org/r/1097521 (owner: 10CDobbins) [16:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427088 (10phaultfinder) [16:30:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations: (Re-)Add monitoring for the internal Ganeti certs - https://phabricator.wikimedia.org/T382902#10427103 (10cmooney) Unsure, but possibly we removed the functionality with this change? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/e0e19d4b509... [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427307 (10phaultfinder) [17:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427328 (10phaultfinder) [17:29:18] (03PS5) 10CDobbins: Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 [17:33:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:57] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:42] 06SRE, 06serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397#10427418 (10jijiki) 05Open→03Resolved a:03jijiki [17:47:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:47:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [17:48:04] 06SRE, 06serviceops: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921#10427421 (10jijiki) 05Open→03Invalid [18:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T1800) [18:00:27] * bd808 looks to see if he has things to ship out [18:03:06] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-12-26-121817-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1107962 [18:13:28] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-12-26-121817-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1107962 (owner: 10BryanDavis) [18:21:26] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-12-26-121817-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1107962 (owner: 10BryanDavis) [18:21:59] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:22:02] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:22:28] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:22:47] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:22:56] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:23:23] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:23:35] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:23:56] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:38:21] (03PS1) 10Jdlrobson: Stop expanding sections by default on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107964 (https://phabricator.wikimedia.org/T376446) [18:44:40] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10427613 (10Andrew) I worked on this a bit over the break. I'm pretty happy with the [[ https://wts.wmcloud.org | static site that httrack produces ]]. It was generated l... [18:50:12] (03CR) 10Ssingh: "Left a note in the original task as well but we should move eqiad to the end of the list vs removing it completely, IMO. We can wait for m" [dns] - 10https://gerrit.wikimedia.org/r/1101908 (https://phabricator.wikimedia.org/T380858) (owner: 10CDobbins) [18:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427644 (10phaultfinder) [19:00:27] (03PS8) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [19:01:15] 10SRE-swift-storage, 07Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802#10427652 (10Koavf) Is it possible to make T382859 a child to this? Thanks. [19:01:19] (03CR) 10CI reject: [V:04-1] Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 (owner: 10Jdlrobson) [19:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427689 (10phaultfinder) [19:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:36] (03CR) 10Ssingh: pdns recursor: support injecting extra hostnames into recursor config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [20:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427807 (10phaultfinder) [20:54:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427815 (10phaultfinder) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:14] no patches indeed [21:01:37] (03CR) 10Urbanecm: [C:03+2] gomwiki: Use wikitext talk pages by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107561 (https://phabricator.wikimedia.org/T382810) (owner: 10Urbanecm) [21:01:44] let's make use of the window [21:02:22] (03Merged) 10jenkins-bot: gomwiki: Use wikitext talk pages by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107561 (https://phabricator.wikimedia.org/T382810) (owner: 10Urbanecm) [21:03:03] (03PS3) 10Urbanecm: [Growth] Remove Marketing campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106316 (https://phabricator.wikimedia.org/T382499) [21:03:06] (03CR) 10Urbanecm: [C:03+2] [Growth] Remove Marketing campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106316 (https://phabricator.wikimedia.org/T382499) (owner: 10Urbanecm) [21:03:45] (03Merged) 10jenkins-bot: [Growth] Remove Marketing campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106316 (https://phabricator.wikimedia.org/T382499) (owner: 10Urbanecm) [21:03:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106316 (https://phabricator.wikimedia.org/T382499) (owner: 10Urbanecm) [21:04:42] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1106316|[Growth] Remove Marketing campaign (T382499)]], [[gerrit:1107561|gomwiki: Use wikitext talk pages by default (T382810)]] [21:04:46] T382499: Remove Marketing experiment related code from GrowthExperiments - https://phabricator.wikimedia.org/T382499 [21:04:46] T382810: Change the default content model for all discussion pages on gomwiki from Structured Discussions back to wikitext - https://phabricator.wikimedia.org/T382810 [21:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427826 (10phaultfinder) [21:17:13] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1106316|[Growth] Remove Marketing campaign (T382499)]], [[gerrit:1107561|gomwiki: Use wikitext talk pages by default (T382810)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:17] T382499: Remove Marketing experiment related code from GrowthExperiments - https://phabricator.wikimedia.org/T382499 [21:17:17] T382810: Change the default content model for all discussion pages on gomwiki from Structured Discussions back to wikitext - https://phabricator.wikimedia.org/T382810 [21:18:01] !log urbanecm@deploy2002 urbanecm: Continuing with sync [21:18:04] proceeding [21:23:26] 06SRE, 06Editing-team, 10MediaWiki-Debug-Logger, 10observability, and 5 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10427858 (10Etonkovidova) 05Open→03Resolved Checked [[ https://logstash.wikimedia.org/goto/763c285468ac71de5a759de3940f9157 | log... [21:26:25] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106316|[Growth] Remove Marketing campaign (T382499)]], [[gerrit:1107561|gomwiki: Use wikitext talk pages by default (T382810)]] (duration: 21m 42s) [21:26:29] T382499: Remove Marketing experiment related code from GrowthExperiments - https://phabricator.wikimedia.org/T382499 [21:26:29] T382810: Change the default content model for all discussion pages on gomwiki from Structured Discussions back to wikitext - https://phabricator.wikimedia.org/T382810 [21:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10427909 (10phaultfinder) [21:42:15] (03CR) 10Pppery: [C:03+1] Fix links pointing to m:Help:Export [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [21:42:55] (03CR) 10Pppery: [C:03+1] "I missed this case during my cleanup for code links to Meta because I was only checking core and extensions, not config." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [21:47:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [21:47:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250102T2200) [22:19:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428241 (10phaultfinder) [22:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428262 (10phaultfinder) [22:39:05] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Some files uploaded on 2024-12-23 not found on upload.wikimedia.org - https://phabricator.wikimedia.org/T382765#10428268 (10mdaniels5757) 5 and 6 are better now too -- I just undeleted and they were fine (both viewing the file when deleted and... [23:23:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428356 (10phaultfinder) [23:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:40] 10SRE-swift-storage, 07Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802#10428360 (10Aklapper) [23:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed