[00:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1156504 [00:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1156504 (owner: 10TrainBranchBot) [00:08:55] !log ladsgroup@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1251.eqiad.wmnet [00:09:20] PROBLEM - MariaDB Replica Lag: s1 #page on db1251 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:13] o/ [00:11:33] this looks like db1251 paged because the SSD firmware update is in progress? [00:11:47] per " SSD firmware update for db125[0-4]" above [00:11:55] ladsgroup@cumin2002 upgrade-firmware (PID 788674) is awaiting input [00:13:27] !incidents [00:13:27] 6348 (ACKED) db1251 (paged)/MariaDB Replica Lag: s1 (paged) [00:13:29] !log ladsgroup@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1251.eqiad.wmnet [00:13:34] acked .. moving on [00:13:51] !log ladsgroup@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1251.eqiad.wmnet with reason: Firmware update [00:14:20] https://sal.toolforge.org/log/n8WKZpcBvg159pQr5C0f [00:14:21] :/ [00:21:20] RECOVERY - MariaDB Replica Lag: s1 #page on db1251 is OK: OK slave_sql_lag Replication lag: 17.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:26:58] (03PS1) 10Arlolra: Disable VipsScaler in group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) [00:28:54] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1156504 (owner: 10TrainBranchBot) [00:29:56] 06SRE: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816 (10Dzahn) 03NEW [00:29:57] !log ladsgroup@cumin2002 START - Cookbook sre.mysql.pool db1251 gradually with 4 steps - Firmware update done [00:30:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911545 (10ops-monitoring-bot) Start pool of db1251 gradually with 4 steps - Firmware update done - ladsgroup@cumin2002 [00:30:36] 06SRE, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#10911547 (10Dzahn) [00:30:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911549 (10Ladsgroup) [00:31:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911550 (10Ladsgroup) a:05Ladsgroup→03None That leaves db1250 which requires a switchover. The rest is done. [00:31:36] !log ladsgroup@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1253 gradually with 4 steps - Firmware updated [00:31:36] (03CR) 10Ladsgroup: [C:03+1] Disable VipsScaler in group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [00:31:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911552 (10ops-monitoring-bot) Completed pool of db1253 gradually with 4 steps - Firmware updated - ladsgroup@cumin2002 [00:32:10] 06SRE, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#10911553 (10Dzahn) [00:32:56] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#10911556 (10Dzahn) [00:46:38] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/828a2ad95042b805a1b95829245de910e82a547462223f3c6b1e5aac1eb17b43/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:06:38] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:14:21] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-ulsfo and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [01:14:25] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [01:17:56] !log ladsgroup@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1251 gradually with 4 steps - Firmware update done [01:18:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911604 (10ops-monitoring-bot) Completed pool of db1251 gradually with 4 steps - Firmware update done - ladsgroup@cumin2002 [01:21:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:36] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:24:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:25:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:30:10] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:09] (03CR) 10Bartosz Dziewoński: "It'll need to be rewritten into PHP, I guess, but that's probably a good idea anyway. This was really just a prototype I built for myself," [puppet] - 10https://gerrit.wikimedia.org/r/1155892 (owner: 10Bartosz Dziewoński) [01:37:23] (03Abandoned) 10Bartosz Dziewoński: tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 (owner: 10Bartosz Dziewoński) [02:33:47] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1017.eqiad.wmnet'] [02:40:02] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1017.eqiad.wmnet'] [02:41:16] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1017.eqiad.wmnet'] [02:57:36] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1017.eqiad.wmnet'] [02:57:43] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [02:57:45] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [02:57:54] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1017.eqiad.wmnet'] [02:58:08] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1017.eqiad.wmnet'] [02:59:32] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1017.eqiad.wmnet with OS bullseye [03:01:44] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1018 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156305 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [03:02:06] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1017 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156304 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [03:02:18] (03CR) 10Andrew Bogott: [C:03+2] Update cloudcephosd1017 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156444 (owner: 10Andrew Bogott) [03:08:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:15:00] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage [03:18:30] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage [03:20:32] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:28:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:35:38] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1017.eqiad.wmnet with OS bullseye [03:35:40] (03PS5) 10Krinkle: multiversion: Remove unused newFromDBName() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 [04:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.076s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:25:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.191s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:40:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.039s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:01:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:01:48] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:01:58] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:08] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [05:18:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb1013.eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:21:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2152.codfw.wmnet with reason: Maintenance [05:21:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T396130)', diff saved to https://phabricator.wikimedia.org/P77881 and previous config saved to /var/cache/conftool/dbconfig/20250613-052114-marostegui.json [05:21:18] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:24:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10911834 (10Marostegui) 05Open→03Resolved a:03Jhancock.wm Everything looks good! Thank you ` communication: 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enc... [05:30:10] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T396130)', diff saved to https://phabricator.wikimedia.org/P77882 and previous config saved to /var/cache/conftool/dbconfig/20250613-053617-marostegui.json [05:36:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:40:25] (03PS1) 10Marostegui: db2189: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156636 (https://phabricator.wikimedia.org/T396549) [05:40:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2189.codfw.wmnet with reason: Maintenance [05:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2189', diff saved to https://phabricator.wikimedia.org/P77883 and previous config saved to /var/cache/conftool/dbconfig/20250613-054156-marostegui.json [05:44:26] (03CR) 10Marostegui: [C:03+2] db2189: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156636 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:49:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77884 and previous config saved to /var/cache/conftool/dbconfig/20250613-054918-root.json [05:51:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P77885 and previous config saved to /var/cache/conftool/dbconfig/20250613-055125-marostegui.json [05:53:40] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10911867 (10tstarling) Per my comments at T54647, the SEO problem with the current situation seems to be... [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250613T0600) [06:04:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77887 and previous config saved to /var/cache/conftool/dbconfig/20250613-060424-root.json [06:06:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P77888 and previous config saved to /var/cache/conftool/dbconfig/20250613-060633-marostegui.json [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:58] PROBLEM - Hadoop DataNode on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:15:10] PROBLEM - Hadoop DataNode on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:15:16] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:15:16] PROBLEM - Hadoop DataNode on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:15:16] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:15:32] PROBLEM - Hadoop DataNode on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:15:32] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:15:34] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:15:48] PROBLEM - Hadoop DataNode on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:15:48] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:16:16] PROBLEM - Hadoop DataNode on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:16:22] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:19:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77889 and previous config saved to /var/cache/conftool/dbconfig/20250613-061930-root.json [06:21:38] (03PS2) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1156191 [06:21:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T396130)', diff saved to https://phabricator.wikimedia.org/P77890 and previous config saved to /var/cache/conftool/dbconfig/20250613-062140-marostegui.json [06:21:45] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:21:46] (03CR) 10Muehlenhoff: "Ack, done" [puppet] - 10https://gerrit.wikimedia.org/r/1156191 (owner: 10Muehlenhoff) [06:21:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:22:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T396130)', diff saved to https://phabricator.wikimedia.org/P77891 and previous config saved to /var/cache/conftool/dbconfig/20250613-062203-marostegui.json [06:26:54] (03PS1) 10Muehlenhoff: Don't auto-restart atftpd on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/1156648 [06:34:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77892 and previous config saved to /var/cache/conftool/dbconfig/20250613-063435-root.json [06:35:43] (03PS1) 10Hashar: tox: pin mypy<1.16.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156651 [06:38:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T396130)', diff saved to https://phabricator.wikimedia.org/P77893 and previous config saved to /var/cache/conftool/dbconfig/20250613-063845-marostegui.json [06:38:50] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:52:27] (03PS1) 10Marostegui: db2175: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156654 (https://phabricator.wikimedia.org/T396549) [06:52:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2175', diff saved to https://phabricator.wikimedia.org/P77894 and previous config saved to /var/cache/conftool/dbconfig/20250613-065239-marostegui.json [06:53:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2175.codfw.wmnet with reason: Maintenance [06:53:46] (03CR) 10Marostegui: [C:03+2] db2175: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156654 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [06:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P77895 and previous config saved to /var/cache/conftool/dbconfig/20250613-065353-marostegui.json [06:56:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10911914 (10MoritzMuehlenhoff) [06:58:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2175.codfw.wmnet with reason: Maintenance [06:59:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77896 and previous config saved to /var/cache/conftool/dbconfig/20250613-065933-root.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250613T0700) [07:00:20] (03PS1) 10Muehlenhoff: memcached/gutter: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156659 [07:00:38] (03CR) 10Marostegui: "Are these tables on wikireplicas? Because if they are added to private lists, we'll get the alerts and we'll have to run redact_sanitarium" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [07:02:44] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#10911926 (10Marostegui) The problem is a bit deeper as one host shouldn't page, but all of them should. eg: massive amount of writes on a master, that won't trigg... [07:03:14] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#10911928 (10Marostegui) p:05Triage→03Medium [07:04:19] (03CR) 10Ayounsi: [C:03+1] Don't auto-restart atftpd on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/1156648 (owner: 10Muehlenhoff) [07:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P77897 and previous config saved to /var/cache/conftool/dbconfig/20250613-070901-marostegui.json [07:12:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:13:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156659 (owner: 10Muehlenhoff) [07:14:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77898 and previous config saved to /var/cache/conftool/dbconfig/20250613-071438-root.json [07:16:06] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [07:16:08] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:16:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:16:53] 10ops-magru: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390258#10911945 (10ayounsi) 05Open→03Invalid Looking at Mar 28 2025, there seems like there was some small events, but nothing worth investigating, we can close that for now. [07:17:52] (03PS2) 10Muehlenhoff: memcached/gutter: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156659 [07:17:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156659 (owner: 10Muehlenhoff) [07:18:46] !log jmm@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:20:32] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:21:40] (03CR) 10Jelto: "I left some comments in-line. Also don't forget to bump the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [07:21:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:21:46] jmm@cumin1003 makevm (PID 1391881) is awaiting input [07:23:54] (03PS1) 10Muehlenhoff: mediawiki/memcached: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156669 [07:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T396130)', diff saved to https://phabricator.wikimedia.org/P77899 and previous config saved to /var/cache/conftool/dbconfig/20250613-072408-marostegui.json [07:24:13] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:24:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2163.codfw.wmnet with reason: Maintenance [07:24:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T396130)', diff saved to https://phabricator.wikimedia.org/P77900 and previous config saved to /var/cache/conftool/dbconfig/20250613-072431-marostegui.json [07:25:40] 06SRE, 06Infrastructure-Foundations, 10netbox: Traceback in sre.dns.netbox - https://phabricator.wikimedia.org/T396834 (10MoritzMuehlenhoff) 03NEW [07:26:17] 06SRE, 06Infrastructure-Foundations, 10netbox: Traceback in sre.dns.netbox accessing a virtual interface - https://phabricator.wikimedia.org/T396834#10911964 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:26:29] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:26:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:26:48] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:29:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77901 and previous config saved to /var/cache/conftool/dbconfig/20250613-072944-root.json [07:31:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:31:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:08] jmm@cumin1003 makevm (PID 1391881) is awaiting input [07:33:00] 06SRE, 06Infrastructure-Foundations, 10netbox: Traceback in sre.dns.netbox accessing a virtual interface - https://phabricator.wikimedia.org/T396834#10911971 (10ayounsi) This match this old IP deletion change: https://netbox.wikimedia.org/extras/changelog/228764/ `assigned_object_id: 3934` My guess is that... [07:33:47] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [07:34:56] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [07:34:56] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:34:56] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7004.magru.wmnet on all recursors [07:34:59] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7004.magru.wmnet on all recursors [07:35:28] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [07:35:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [07:36:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:36:48] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:39:50] jmm@cumin1003 makevm (PID 1391881) is awaiting input [07:41:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T396130)', diff saved to https://phabricator.wikimedia.org/P77903 and previous config saved to /var/cache/conftool/dbconfig/20250613-074110-marostegui.json [07:41:16] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:41:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:42:31] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7004.magru.wmnet with OS bookworm [07:44:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77904 and previous config saved to /var/cache/conftool/dbconfig/20250613-074450-root.json [07:46:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:46:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10911993 (10Anton.Kokh) Hi there, It's Anton Kokh, anton.kokh@wikimedia.de [07:51:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:51:48] FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:56:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156669 (owner: 10Muehlenhoff) [07:56:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P77905 and previous config saved to /var/cache/conftool/dbconfig/20250613-075618-marostegui.json [07:56:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:56:43] FIRING: [11x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:56:58] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:01:43] FIRING: [9x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:03:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10912010 (10elukey) @KFrancis Hi! The email is anton.kokh@wikimedia.de :) [08:05:48] (03PS1) 10Filippo Giunchedi: hieradata: restrict titan memcached access [puppet] - 10https://gerrit.wikimedia.org/r/1156728 (https://phabricator.wikimedia.org/T394319) [08:06:43] RESOLVED: [6x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:08:43] (03CR) 10JMeybohm: [C:03+2] calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [08:08:50] (03CR) 10JMeybohm: [C:03+2] coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [08:10:12] (03PS3) 10Hashar: Change log format to get name of image being built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 [08:11:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P77908 and previous config saved to /var/cache/conftool/dbconfig/20250613-081126-marostegui.json [08:14:52] (03Merged) 10jenkins-bot: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [08:15:39] (03Merged) 10jenkins-bot: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [08:19:40] (03CR) 10Clément Goubert: [C:03+1] zarcillo: Allow egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156401 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:23:24] (03CR) 10Tacsipacsi: "> We forgot about it 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T14019) (owner: 10Zabe) [08:25:48] (03PS1) 10Hashar: Stream build lines as individual logs [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 [08:26:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T396130)', diff saved to https://phabricator.wikimedia.org/P77909 and previous config saved to /var/cache/conftool/dbconfig/20250613-082633-marostegui.json [08:26:38] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:26:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2164.codfw.wmnet with reason: Maintenance [08:26:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T396130)', diff saved to https://phabricator.wikimedia.org/P77910 and previous config saved to /var/cache/conftool/dbconfig/20250613-082656-marostegui.json [08:32:44] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:35:53] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:39:08] (03PS2) 10Hashar: Stream build lines as individual log entries [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 [08:42:49] (03PS2) 10Svantje Lilienthal: Enable sub-referencing on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) [08:43:10] (03CR) 10CI reject: [V:04-1] Enable sub-referencing on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal) [08:43:11] (03CR) 10Elukey: [C:03+1] tox: pin mypy<1.16.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156651 (owner: 10Hashar) [08:43:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T396130)', diff saved to https://phabricator.wikimedia.org/P77911 and previous config saved to /var/cache/conftool/dbconfig/20250613-084325-marostegui.json [08:43:29] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:45:54] (03CR) 10Elukey: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [08:48:39] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10912217 (10Gehel) [08:48:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10912221 (10Gehel) [08:48:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): decommission relforge100[34] - https://phabricator.wikimedia.org/T390565#10912227 (10Gehel) [08:49:42] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir7004.magru.wmnet with OS bookworm [08:49:42] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7004.magru.wmnet [08:49:45] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10912253 (10Gehel) [08:49:51] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:50:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10912275 (10Gehel) [08:50:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10912277 (10Gehel) [08:52:01] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10912310 (10Gehel) [08:53:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10912330 (10Gehel) [08:53:09] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:53:56] 07sre-alert-triage, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10912358 (10Gehel) [08:54:42] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:54:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10912372 (10Gehel) [08:55:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10912375 (10Gehel) [08:55:22] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10912382 (10Gehel) [08:56:25] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:56:50] (03PS8) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [08:58:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P77912 and previous config saved to /var/cache/conftool/dbconfig/20250613-085832-marostegui.json [09:01:50] (03CR) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [09:04:32] PROBLEM - HTTP on install7001 is CRITICAL: connect to address 195.200.68.7 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [09:04:34] PROBLEM - TFTP service on install7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [09:05:10] FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:53] (03CR) 10AOkoth: miscweb: add os-reports update mechanism (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [09:07:26] FIRING: [2x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:00] (03CR) 10Brouberol: [C:03+2] "@french" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [09:10:33] (03PS8) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [09:12:28] (03CR) 10CI reject: [V:04-1] miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [09:12:43] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup1002.eqiad.wmnet with reason: Maintenance and reboot [09:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P77913 and previous config saved to /var/cache/conftool/dbconfig/20250613-091339-marostegui.json [09:15:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1156728 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [09:17:36] (03PS9) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [09:17:49] (03PS1) 10Marostegui: db2148: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156745 (https://phabricator.wikimedia.org/T396549) [09:18:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2148', diff saved to https://phabricator.wikimedia.org/P77914 and previous config saved to /var/cache/conftool/dbconfig/20250613-091800-marostegui.json [09:18:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [09:18:29] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156744 (https://phabricator.wikimedia.org/T396596) [09:18:32] (03CR) 10Marostegui: [C:03+2] db2148: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156745 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [09:20:01] (03PS2) 10Filippo Giunchedi: hieradata: restrict titan memcached access [puppet] - 10https://gerrit.wikimedia.org/r/1156728 (https://phabricator.wikimedia.org/T394319) [09:20:32] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:21:10] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: restrict titan memcached access [puppet] - 10https://gerrit.wikimedia.org/r/1156728 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [09:22:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77915 and previous config saved to /var/cache/conftool/dbconfig/20250613-092236-root.json [09:22:51] (03CR) 10Itamar Givon: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156744 (https://phabricator.wikimedia.org/T396596) (owner: 10Jakob) [09:28:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T396130)', diff saved to https://phabricator.wikimedia.org/P77916 and previous config saved to /var/cache/conftool/dbconfig/20250613-092847-marostegui.json [09:28:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:29:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2165.codfw.wmnet with reason: Maintenance [09:29:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T396130)', diff saved to https://phabricator.wikimedia.org/P77917 and previous config saved to /var/cache/conftool/dbconfig/20250613-092910-marostegui.json [09:31:04] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156744 (https://phabricator.wikimedia.org/T396596) (owner: 10Jakob) [09:32:48] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156744 (https://phabricator.wikimedia.org/T396596) (owner: 10Jakob) [09:34:00] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:34:11] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:34:52] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [09:35:13] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [09:35:27] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [09:35:47] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [09:35:52] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on install7001.wikimedia.org with reason: being replaced by install7002 [09:37:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77918 and previous config saved to /var/cache/conftool/dbconfig/20250613-093742-root.json [09:39:32] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:39:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:39:56] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10912563 (10elukey) @RKemper Hi! Thanks a lot for the follow up. One question - I noticed that https://wikitech.wikimedia.org/wiki/SLO/W... [09:40:17] RESOLVED: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:42:04] (03CR) 10JMeybohm: admin_ng: define a priority class optional environment feature (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [09:45:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T396130)', diff saved to https://phabricator.wikimedia.org/P77919 and previous config saved to /var/cache/conftool/dbconfig/20250613-094552-marostegui.json [09:45:56] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:47:29] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup1001.eqiad.wmnet with reason: Maintenance and reboot [09:49:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:52:25] FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77920 and previous config saved to /var/cache/conftool/dbconfig/20250613-095248-root.json [09:56:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:57:39] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: Allow egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156401 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [09:59:33] (03PS5) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) [09:59:33] (03PS3) 10Alexandros Kosiaris: docker_registry: Move rsyslog rules from init to web.pp [puppet] - 10https://gerrit.wikimedia.org/r/1155257 (https://phabricator.wikimedia.org/T390251) [09:59:33] (03PS3) 10Alexandros Kosiaris: docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) [10:01:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P77921 and previous config saved to /var/cache/conftool/dbconfig/20250613-100059-marostegui.json [10:01:48] (03PS1) 10Btullis: Fix the connector name for the prometheus presto catalog [puppet] - 10https://gerrit.wikimedia.org/r/1156760 (https://phabricator.wikimedia.org/T347430) [10:01:53] (03PS1) 10Alexandros Kosiaris: registry: Add hiera for the new hierarchy [labs/private] - 10https://gerrit.wikimedia.org/r/1156761 (https://phabricator.wikimedia.org/T390251) [10:01:54] (03PS1) 10Alexandros Kosiaris: Remove old docker_registry_ha hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/1156762 (https://phabricator.wikimedia.org/T390251) [10:02:00] (03CR) 10CI reject: [V:04-1] docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:02:06] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [10:03:08] (03PS3) 10Svantje Lilienthal: Enable sub-referencing on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) [10:03:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:28] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add codfw1dev auth v6 VIPs - taavi@cumin1003" [10:05:32] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add codfw1dev auth v6 VIPs - taavi@cumin1003" [10:05:32] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:07:19] (03PS3) 10Btullis: Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) [10:07:29] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] registry: Add hiera for the new hierarchy [labs/private] - 10https://gerrit.wikimedia.org/r/1156761 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:07:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77922 and previous config saved to /var/cache/conftool/dbconfig/20250613-100754-root.json [10:08:00] (03PS2) 10Btullis: Fix the connector name for the prometheus presto catalog [puppet] - 10https://gerrit.wikimedia.org/r/1156760 (https://phabricator.wikimedia.org/T347430) [10:08:27] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:08:51] (03CR) 10Btullis: [C:03+2] Fix the connector name for the prometheus presto catalog [puppet] - 10https://gerrit.wikimedia.org/r/1156760 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [10:08:58] (03CR) 10CI reject: [V:04-1] Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [10:11:08] (03CR) 10Hnowlan: [C:04-1] "Looks okay in terms of approach, changes needed to move yaml anchors around." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:11:37] (03PS9) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [10:12:36] (03PS1) 10Alexandros Kosiaris: pontoon: Add stack registry [puppet] - 10https://gerrit.wikimedia.org/r/1156767 (https://phabricator.wikimedia.org/T390251) [10:13:15] (03CR) 10CI reject: [V:04-1] pontoon: Add stack registry [puppet] - 10https://gerrit.wikimedia.org/r/1156767 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:13:42] (03CR) 10JMeybohm: admin_ng: define a priority class optional environment feature (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [10:14:53] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, config.yaml is missing SPDX and I'll send a followup patch to fix it in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/1156767 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:14:58] (03PS2) 10Alexandros Kosiaris: pontoon: Add stack registry [puppet] - 10https://gerrit.wikimedia.org/r/1156767 (https://phabricator.wikimedia.org/T390251) [10:15:31] (03CR) 10Alexandros Kosiaris: "Pontoon test was successful, change for adding the stack is at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156767" [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:15:33] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:15:37] (03PS10) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [10:16:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P77923 and previous config saved to /var/cache/conftool/dbconfig/20250613-101607-marostegui.json [10:16:24] (03PS11) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [10:16:57] (03PS12) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [10:18:46] (03PS13) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [10:21:15] (03PS1) 10Muehlenhoff: Record LDAP access for slong [puppet] - 10https://gerrit.wikimedia.org/r/1156772 [10:23:25] (03CR) 10Filippo Giunchedi: [C:03+1] pontoon: Add stack registry [puppet] - 10https://gerrit.wikimedia.org/r/1156767 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:23:29] 10ops-codfw, 06DC-Ops: db2212 not powering up - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T396852 (10FCeratto-WMF) 03NEW [10:23:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on db2212.codfw.wmnet with reason: Not powering up [10:23:58] 10ops-codfw, 06DC-Ops: db2212 not powering up - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T396852#10912709 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0808ecb5-85df-4c4f-adf2-cca3d3eb7bd1) set by fceratto@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services... [10:25:51] (03CR) 10Jelto: miscweb: add os-reports update mechanism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:25:58] (03CR) 10JMeybohm: [C:03+1] admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [10:26:50] (03PS14) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [10:27:44] (03CR) 10Muehlenhoff: "The patch won't hurt, but is it really needed, did you encounter it in the wild? Every one of those lua module packages depends on the cur" [puppet] - 10https://gerrit.wikimedia.org/r/1155160 (owner: 10Kamila Součková) [10:27:52] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for slong [puppet] - 10https://gerrit.wikimedia.org/r/1156772 (owner: 10Muehlenhoff) [10:29:19] (03CR) 10Btullis: [C:03+1] admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [10:31:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T396130)', diff saved to https://phabricator.wikimedia.org/P77924 and previous config saved to /var/cache/conftool/dbconfig/20250613-103114-marostegui.json [10:31:19] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:31:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2166.codfw.wmnet with reason: Maintenance [10:31:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T396130)', diff saved to https://phabricator.wikimedia.org/P77925 and previous config saved to /var/cache/conftool/dbconfig/20250613-103137-marostegui.json [10:33:29] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T396854 (10Anipedia) 03NEW [10:35:29] (03CR) 10WMDE-Fisch: [C:03+1] Enable sub-referencing on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal) [10:35:53] 06SRE, 10SRE-Access-Requests: Requesting access to for Anipedia - https://phabricator.wikimedia.org/T396854#10912760 (10Anipedia) [10:37:33] 06SRE, 10SRE-Access-Requests: Requesting access to for Anipedia - https://phabricator.wikimedia.org/T396854#10912762 (10Anipedia) I want create this Santali News Website for Learning [10:38:25] (03CR) 10Hnowlan: [C:03+1] changeprop: Remove rules related to page/title (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156396 (owner: 10Jgiannelos) [10:39:57] (03CR) 10Elukey: [C:03+1] docker_registry: Refactor to allow >1 instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:42:51] (03PS1) 10Filippo Giunchedi: pontoon: write SPDX header to stack config on save [puppet] - 10https://gerrit.wikimedia.org/r/1156781 [10:45:46] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-backup1001.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [10:45:53] (03PS4) 10Alexandros Kosiaris: docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) [10:46:13] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:48:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T396130)', diff saved to https://phabricator.wikimedia.org/P77926 and previous config saved to /var/cache/conftool/dbconfig/20250613-104816-marostegui.json [10:48:21] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:48:24] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-backup1002.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [10:48:53] (03PS1) 10Btullis: Update the connection parameters for presto->prometheus in test [puppet] - 10https://gerrit.wikimedia.org/r/1156786 (https://phabricator.wikimedia.org/T347430) [10:55:55] (03CR) 10Alexandros Kosiaris: [C:03+2] pontoon: Add stack registry [puppet] - 10https://gerrit.wikimedia.org/r/1156767 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:58:49] (03CR) 10Alexandros Kosiaris: "PCC got 1 single worrying diff" [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250613T0700) [11:00:06] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250613T1100). [11:03:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P77927 and previous config saved to /var/cache/conftool/dbconfig/20250613-110324-marostegui.json [11:05:40] (03PS6) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) [11:05:40] (03PS4) 10Alexandros Kosiaris: docker_registry: Move rsyslog rules from init to web.pp [puppet] - 10https://gerrit.wikimedia.org/r/1155257 (https://phabricator.wikimedia.org/T390251) [11:05:40] (03PS5) 10Alexandros Kosiaris: docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) [11:06:01] (03PS6) 10Alexandros Kosiaris: docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) [11:09:02] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:14:41] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7004.magru.wmnet with OS bookworm [11:17:16] (03CR) 10Alexandros Kosiaris: [C:03+2] docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:17:26] (03CR) 10Btullis: [C:03+2] Update the connection parameters for presto->prometheus in test [puppet] - 10https://gerrit.wikimedia.org/r/1156786 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [11:17:30] (03CR) 10Alexandros Kosiaris: [C:03+2] docker_registry: Move rsyslog rules from init to web.pp [puppet] - 10https://gerrit.wikimedia.org/r/1155257 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:17:40] (03CR) 10Alexandros Kosiaris: [C:03+2] docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:18:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P77928 and previous config saved to /var/cache/conftool/dbconfig/20250613-111832-marostegui.json [11:22:30] !log marostegui@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [11:22:39] Morning [11:22:43] (03PS3) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) [11:23:10] I seem to getting 429 Errors when trying to reach Wikimedia's Phabricator? [11:23:43] The traceroute says I am trying to reach this via esams (Amsterdam).. [11:23:53] Is this a known issue? [11:23:54] (03CR) 10Jgiannelos: "I updated the patch and completely remove the anchors and their references which led to removing a complete section of rules that was not " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [11:24:47] ShakespeareFan00: 429 errors happen when you get rate limitted? is it possible that you or someone on your same ip has been doing a lot of hits to phabricator? [11:25:51] I can't think of anything [11:26:20] My attempts to view phabriactor have been normal usage [11:27:02] If there is an overload, it's not coming from here [11:27:23] as far as I can tell. [11:28:05] If it's a specifc ISP/router, that's not in my control. [11:29:14] (03PS3) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [11:29:38] The specific error text was "Request served via cp3068 cp3068, Varnish XID 58148725 [11:29:38] Upstream caches: cp3068 int [11:29:38] Error: 429, at Fri, 13 Jun 2025 11:29:09 GMT" [11:30:14] As you say something is possibly making a lot of request, but it isn't me as far as I can tell. [11:31:35] It's very frustrating, when you can't even reach the bug-tracker. [11:31:44] (03CR) 10Hnowlan: [C:03+1] changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [11:32:15] (03PS4) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) [11:32:24] @jynus: Can you help me narrow down where the overload might be? [11:32:42] (03Abandoned) 10Jgiannelos: changeprop: Remove rules related to page/title (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156396 (owner: 10Jgiannelos) [11:32:55] (because I don't think it's an issue on the local machine.) [11:33:24] (03CR) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [11:33:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T396130)', diff saved to https://phabricator.wikimedia.org/P77929 and previous config saved to /var/cache/conftool/dbconfig/20250613-113339-marostegui.json [11:33:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:33:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance [11:34:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77930 and previous config saved to /var/cache/conftool/dbconfig/20250613-113402-marostegui.json [11:34:05] ShakespeareFan00: I am afraid I cannot help much, phabricator works for me and for other users [11:36:52] my suggestion is, as you cannot open a ticket, send an email following: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [11:41:47] !log T390251 re-enable puppet on registry1004 after merging puppet refactoring changes. [11:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:51] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [11:43:10] PROBLEM - Docker registry health on registry1004 is CRITICAL: connect to address 10.64.32.143 and port 5001: Connection refused https://wikitech.wikimedia.org/wiki/Docker [11:43:19] (03CR) 10Brouberol: [C:03+2] admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [11:43:30] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string schemaVersion not found on https://registry1004.eqiad.wmnet:443/v2/bullseye/manifests/latest - 364 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Docker [11:43:37] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7004.magru.wmnet with reason: host reimage [11:44:06] FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:30] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Docker [11:45:10] RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [11:45:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:45:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:46:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7004.magru.wmnet with reason: host reimage [11:47:26] RESOLVED: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:32] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1018.eqiad.wmnet'] [11:48:25] RESOLVED: [2x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:37] (03PS1) 10Alexandros Kosiaris: docker_registry: Make sure ports are in the right format [puppet] - 10https://gerrit.wikimedia.org/r/1156809 (https://phabricator.wikimedia.org/T390251) [11:49:04] (03PS1) 10Marostegui: db1182: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156810 (https://phabricator.wikimedia.org/T396549) [11:49:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P77931 and previous config saved to /var/cache/conftool/dbconfig/20250613-114917-marostegui.json [11:49:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:49:50] (03CR) 10Marostegui: [C:03+2] db1182: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156810 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [11:50:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77932 and previous config saved to /var/cache/conftool/dbconfig/20250613-115049-marostegui.json [11:50:53] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:53:23] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156809 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:53:40] (03PS1) 10Hnowlan: rest-gateway: route html<->wikitext transforms to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156811 (https://phabricator.wikimedia.org/T396856) [11:54:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1018.eqiad.wmnet'] [11:54:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77933 and previous config saved to /var/cache/conftool/dbconfig/20250613-115438-root.json [11:54:46] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1018.eqiad.wmnet'] [11:55:14] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1018.eqiad.wmnet'] [11:55:30] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1018.eqiad.wmnet [11:56:19] (03PS2) 10Alexandros Kosiaris: docker_registry: Make sure ports are in the right format [puppet] - 10https://gerrit.wikimedia.org/r/1156809 (https://phabricator.wikimedia.org/T390251) [11:56:28] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156809 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [11:56:37] (03PS1) 10Brouberol: airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) [11:57:39] (03PS2) 10Brouberol: airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) [11:58:17] (03PS1) 10Hnowlan: trafficserver: migrate html<->wikitext transforms out of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1156813 (https://phabricator.wikimedia.org/T396856) [11:59:44] (03CR) 10Clément Goubert: [C:03+1] docker_registry: Make sure ports are in the right format [puppet] - 10https://gerrit.wikimedia.org/r/1156809 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [12:02:48] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7004.magru.wmnet with OS bookworm [12:05:31] !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1018.eqiad.wmnet [12:05:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P77934 and previous config saved to /var/cache/conftool/dbconfig/20250613-120557-marostegui.json [12:06:02] (03PS1) 10Muehlenhoff: Apply ncredir role to ncredir7004 [puppet] - 10https://gerrit.wikimedia.org/r/1156814 (https://phabricator.wikimedia.org/T394263) [12:06:04] (03PS1) 10Muehlenhoff: Add ncredir7004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1156815 (https://phabricator.wikimedia.org/T394263) [12:09:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77935 and previous config saved to /var/cache/conftool/dbconfig/20250613-120944-root.json [12:12:52] (03CR) 10Alexandros Kosiaris: [C:03+2] docker_registry: Make sure ports are in the right format [puppet] - 10https://gerrit.wikimedia.org/r/1156809 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [12:15:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1018.eqiad.wmnet [12:15:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cloudcephosd1018.eqiad.wmnet [12:16:21] (03PS2) 10Andrew Bogott: Update cloudcephosd1018 with probable new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156445 [12:16:31] (03PS2) 10Andrew Bogott: Update cloudcephosd1019 with probable new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156446 [12:17:58] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1018.eqiad.wmnet with OS bullseye [12:18:12] (03CR) 10Andrew Bogott: [C:03+2] Update cloudcephosd1018 with probable new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156445 (owner: 10Andrew Bogott) [12:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P77936 and previous config saved to /var/cache/conftool/dbconfig/20250613-122104-marostegui.json [12:21:19] !log T390251 re-enable puppet on all registries. [12:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:23] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [12:23:12] (03CR) 10Ilias Sarantopoulos: "This is not true, I accidentally scheduled this for deployment so please disregard." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [12:23:25] (03CR) 10Jgiannelos: [C:03+1] "Looks OK. Can we merge this one first so we can test a bit the endpoints before switching over traffic? We do have some tests here:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156811 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan) [12:24:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77937 and previous config saved to /var/cache/conftool/dbconfig/20250613-122449-root.json [12:26:10] andrew@cumin1002 reimage (PID 1778950) is awaiting input [12:27:59] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1018.eqiad.wmnet with OS bullseye [12:28:32] (03CR) 10Vgutierrez: [C:03+1] Apply ncredir role to ncredir7004 [puppet] - 10https://gerrit.wikimedia.org/r/1156814 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:28:46] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1018.eqiad.wmnet with OS bullseye [12:36:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77938 and previous config saved to /var/cache/conftool/dbconfig/20250613-123612-marostegui.json [12:36:17] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:36:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2181.codfw.wmnet with reason: Maintenance [12:36:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T396130)', diff saved to https://phabricator.wikimedia.org/P77939 and previous config saved to /var/cache/conftool/dbconfig/20250613-123635-marostegui.json [12:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77940 and previous config saved to /var/cache/conftool/dbconfig/20250613-123955-root.json [12:40:59] jouncebot: nowandnext [12:40:59] For the next 18 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250613T0700) [12:40:59] In 18 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250614T0700) [12:41:06] oh yeah, Friday [12:42:15] :D [12:42:30] TheresNoTime: can I bash that? :P [12:42:37] yes x3 [12:43:10] proposed new rule: deploys can be done as normal on Fridays that happen to be the 13th day of the month, for maximum chaos [12:43:27] the rare self-dating quip (timestamp in the wikitech URL hash) https://bash.toolforge.org/quip/1-VQaZcB8tZ8Ohr0vssG [12:43:27] +1 [12:43:43] !trout taavi [12:44:07] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: host reimage [12:44:09] what [12:44:35] I guess there’ll be at least one deploy today anyway, once the train is unblocked [12:44:43] (03CR) 10Hashar: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [12:45:05] (03PS4) 10Hashar: Change log format to get name of image being built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 [12:45:05] (03PS3) 10Hashar: Stream build lines as individual logs [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 [12:46:49] (03CR) 10Btullis: [C:03+1] "Looks good to me, but can we deploy next week, just to be on the safe side?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [12:47:31] (03CR) 10Brouberol: "Sure, no problem!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [12:48:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: host reimage [12:53:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T396130)', diff saved to https://phabricator.wikimedia.org/P77941 and previous config saved to /var/cache/conftool/dbconfig/20250613-125314-marostegui.json [12:53:19] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:53:35] (03PS1) 10Brouberol: mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) [12:53:52] (03CR) 10Kamila Součková: "I did encounter it, but turns out apt was unhappy on the VM, so actually probably not needed. Thanks for checking!" [puppet] - 10https://gerrit.wikimedia.org/r/1155160 (owner: 10Kamila Součková) [12:54:05] (03Abandoned) 10Kamila Součková: modules/nginx: install extra modules after main nginx package [puppet] - 10https://gerrit.wikimedia.org/r/1155160 (owner: 10Kamila Součková) [12:54:29] (03PS1) 10Gmodena: dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) [12:59:15] (03PS2) 10Gmodena: dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) [13:02:40] (03PS1) 10Btullis: Presto: Add a prometheus connector pointing to thanos [puppet] - 10https://gerrit.wikimedia.org/r/1156823 (https://phabricator.wikimedia.org/T347430) [13:03:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5964/co" [puppet] - 10https://gerrit.wikimedia.org/r/1156823 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [13:05:35] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1018.eqiad.wmnet with OS bullseye [13:08:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P77942 and previous config saved to /var/cache/conftool/dbconfig/20250613-130822-marostegui.json [13:15:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:17:15] @jynus: Well e-mail sent, but the adresses will not be back until August.. [13:17:28] Thank you for the assistance you were able to provide. [13:20:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:20:45] (03CR) 10Filippo Giunchedi: "Please see my question at https://phabricator.wikimedia.org/T347430#10908972 and related: how much risk there is of big heavy analytical q" [puppet] - 10https://gerrit.wikimedia.org/r/1156823 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [13:23:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P77944 and previous config saved to /var/cache/conftool/dbconfig/20250613-132329-marostegui.json [13:32:30] (03PS1) 10Alexandros Kosiaris: docker_registry: Instantiate APUs s3 backend instance [puppet] - 10https://gerrit.wikimedia.org/r/1156829 (https://phabricator.wikimedia.org/T390251) [13:34:43] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156829 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [13:36:05] (03PS1) 10Brouberol: mediawiki-dumps-legacy: allow the airflow service account to query CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156830 (https://phabricator.wikimedia.org/T389786) [13:36:06] (03PS1) 10Brouberol: mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) [13:36:08] (03PS1) 10Brouberol: mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) [13:38:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T396130)', diff saved to https://phabricator.wikimedia.org/P77947 and previous config saved to /var/cache/conftool/dbconfig/20250613-133837-marostegui.json [13:38:42] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:38:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2195.codfw.wmnet with reason: Maintenance [13:39:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T396130)', diff saved to https://phabricator.wikimedia.org/P77948 and previous config saved to /var/cache/conftool/dbconfig/20250613-133900-marostegui.json [13:42:30] (03PS2) 10Alexandros Kosiaris: docker_registry: Instantiate APUs s3 backend instance [puppet] - 10https://gerrit.wikimedia.org/r/1156829 (https://phabricator.wikimedia.org/T390251) [13:42:30] (03PS1) 10Alexandros Kosiaris: docker_registry: Pass defaults to 2 option parameters [puppet] - 10https://gerrit.wikimedia.org/r/1156835 (https://phabricator.wikimedia.org/T390251) [13:43:35] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156829 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [13:45:18] (03CR) 10Xcollazo: [C:03+1] dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [13:47:38] go PixDeVl [13:47:38] (03PS4) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [13:47:42] heh nope [13:48:14] (03PS4) 10JMeybohm: kind.sh can bootstrap a wikikube like cluster with kind [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) [13:48:38] (03CR) 10Aqu: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [13:48:41] (03CR) 10Alexandros Kosiaris: [C:03+1] "PCC works out, I 've created the bucket already, let's see!" [puppet] - 10https://gerrit.wikimedia.org/r/1156829 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [13:49:41] (03CR) 10CI reject: [V:04-1] New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:51:40] (03PS5) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [13:53:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T396130)', diff saved to https://phabricator.wikimedia.org/P77949 and previous config saved to /var/cache/conftool/dbconfig/20250613-135336-marostegui.json [13:53:41] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:54:18] (03CR) 10Alexandros Kosiaris: [C:03+2] docker_registry: Instantiate APUs s3 backend instance [puppet] - 10https://gerrit.wikimedia.org/r/1156829 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [13:54:26] (03CR) 10Alexandros Kosiaris: [C:03+2] docker_registry: Pass defaults to 2 option parameters [puppet] - 10https://gerrit.wikimedia.org/r/1156835 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [13:55:10] FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:01] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10913214 (10Scott_French) Now that the new alert has been live for a couple of days without issuing false... [13:59:22] (03PS1) 10Scott French: alertmanager: update data-persistence-task phid [puppet] - 10https://gerrit.wikimedia.org/r/1156837 (https://phabricator.wikimedia.org/T390630) [13:59:31] (03PS1) 10Scott French: sessionstore-resources: move SessionStoreDiskSpaceRunwayTooLow to task [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) [14:00:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:00:53] (03PS6) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [14:01:42] (03PS1) 10Brouberol: airflow: tweak the environment variable exposing whether we're in dev/prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156843 (https://phabricator.wikimedia.org/T394297) [14:05:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:06:05] (03CR) 10Mforns: [C:03+1] "LGTM! Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156843 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [14:06:28] (03CR) 10Herron: [C:03+1] thanos: add memcached-based index caching to store [puppet] - 10https://gerrit.wikimedia.org/r/1156341 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [14:06:32] (03CR) 10Jforrester: [C:03+1] Disable VipsScaler in group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [14:06:58] (03CR) 10Herron: [C:03+1] thanos: trial store memcache on titan[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/1156342 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [14:07:20] (03CR) 10Herron: [C:03+1] thanos: activate store memcached across the board [puppet] - 10https://gerrit.wikimedia.org/r/1156343 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [14:08:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P77950 and previous config saved to /var/cache/conftool/dbconfig/20250613-140844-marostegui.json [14:09:19] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-f3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393785#10913256 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:10:10] FIRING: [3x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:53] (03CR) 10Brouberol: [C:03+2] airflow: tweak the environment variable exposing whether we're in dev/prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156843 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [14:16:58] (03CR) 10Eevans: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1156837 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:17:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:17:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:18:01] (03CR) 10Eevans: [C:03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:23:12] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@cab8d81]: hotfix-bump SEAL to v0.9.0 [14:23:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P77951 and previous config saved to /var/cache/conftool/dbconfig/20250613-142351-marostegui.json [14:25:07] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@cab8d81]: hotfix-bump SEAL to v0.9.0 (duration: 02m 26s) [14:29:22] 10ops-codfw, 06SRE, 06DC-Ops: db2212 not powering up - Comm Error: Backplane 0. - https://phabricator.wikimedia.org/T396852#10913313 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @Marostegui reseated all the connections to the backplane. passes post and pings now. No alerts in the idrac. If it... [14:30:12] (03CR) 10Scott French: [C:03+1] "Thanks for separating this out!" [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:37:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:37:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:38:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T396130)', diff saved to https://phabricator.wikimedia.org/P77952 and previous config saved to /var/cache/conftool/dbconfig/20250613-143859-marostegui.json [14:39:04] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:39:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [14:42:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:42:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:10] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10913462 (10akosiaris) Thanks, this helped. After creating manually the `registry-restricted` (docker-registry will return 503 if it doesn't exist) bucket with `s3c... [14:53:39] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:54:08] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:57:00] (03PS1) 10Eevans: cassandra: additional config parameters [puppet] - 10https://gerrit.wikimedia.org/r/1156851 [14:58:37] 10ops-codfw, 06SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10913490 (10Jhancock.wm) double checking before i do something. Is this server depooled/drained? an idrac firmware upgrade does require a reboot. [15:02:45] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156851 (owner: 10Eevans) [15:03:51] PROBLEM - Disk space on stat1011 is CRITICAL: DISK CRITICAL - free space: / 2145 MB (3% inode=83%): /tmp 2145 MB (3% inode=83%): /var/tmp 2145 MB (3% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1011&var-datasource=eqiad+prometheus/ops [15:04:33] (03CR) 10Eevans: [C:03+2] cassandra: additional config parameters [puppet] - 10https://gerrit.wikimedia.org/r/1156851 (owner: 10Eevans) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:07] (03PS10) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [15:12:34] (03CR) 10Scott French: "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [15:13:02] (03CR) 10CI reject: [V:04-1] miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:15:12] (03CR) 10AOkoth: miscweb: add os-reports update mechanism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:18] (03CR) 10Hnowlan: [C:03+1] changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [15:22:49] (03CR) 10Hnowlan: [C:03+1] changeprop: Remove rules related to parsoid (RB sunset) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [15:24:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:32] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10913611 (10Scott_French) a:03Scott_French [15:29:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 5.350 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:29:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:36:25] herron, brett, ChrisDobbins901_: noting that we're planning to roll the train forward shortly (has been blocked on T396790) in coordination with growth folks. [15:36:26] T396790: TypeError: GrowthExperiments\NewcomerTasks\ConfigurationLoader\CommunityConfigurationLoader::loadTaskTypes(): Return value must be of type array, MediaWiki\Extension\CommunityConfiguration\Validation\ValidationStatus returned - https://phabricator.wikimedia.org/T396790 [15:36:54] o/ [15:37:02] brennen: ack, thanks for the heads-up [15:49:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [15:50:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [15:51:45] (03PS1) 10Brennen Bearnes: Revert "group1 to 1.45.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156860 (https://phabricator.wikimedia.org/T392175) [15:51:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:57] (03PS1) 10Btullis: Bump the mediawiki-dumps-legacy toolbox image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156862 (https://phabricator.wikimedia.org/T394389) [15:52:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:27] sergi0, thcipriani: starting backport of the above revert [15:53:01] ack [15:53:09] thanks brennen [15:54:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156860 (https://phabricator.wikimedia.org/T392175) (owner: 10Brennen Bearnes) [15:54:38] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:54:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 1.704 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:55:01] (03Merged) 10jenkins-bot: Revert "group1 to 1.45.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156860 (https://phabricator.wikimedia.org/T392175) (owner: 10Brennen Bearnes) [15:55:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:55:12] (03CR) 10Btullis: [C:03+2] Bump the mediawiki-dumps-legacy toolbox image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156862 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [15:55:27] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1156860|Revert "group1 to 1.45.0-wmf.5" (T392175 T396790)]] [15:55:32] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [15:55:33] T396790: TypeError: GrowthExperiments\NewcomerTasks\ConfigurationLoader\CommunityConfigurationLoader::loadTaskTypes(): Return value must be of type array, MediaWiki\Extension\CommunityConfiguration\Validation\ValidationStatus returned - https://phabricator.wikimedia.org/T396790 [15:55:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:47] (03Merged) 10jenkins-bot: Bump the mediawiki-dumps-legacy toolbox image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156862 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [15:57:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:59:25] !log brennen@deploy1003 brennen: Backport for [[gerrit:1156860|Revert "group1 to 1.45.0-wmf.5" (T392175 T396790)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:59:59] sergi0: over to you, let me know when you'd like to roll forward [16:00:16] last check of configs [16:00:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:01:29] let's do it [16:01:33] going [16:01:38] !log brennen@deploy1003 brennen: Continuing with sync [16:02:42] 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10913798 (10Jhancock.wm) got this moved around as needed. Thank you both! (will close ticket at EoD for stashbot reasons) [16:05:10] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:06:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:06:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:08:34] (03PS1) 10Phuedx: ext-EventStreamConfig: Update product_metrics.web_base stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) [16:09:20] 84%. [16:09:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) (owner: 10Phuedx) [16:09:49] ack [16:10:24] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156860|Revert "group1 to 1.45.0-wmf.5" (T392175 T396790)]] (duration: 14m 56s) [16:10:30] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [16:10:30] T396790: TypeError: GrowthExperiments\NewcomerTasks\ConfigurationLoader\CommunityConfigurationLoader::loadTaskTypes(): Return value must be of type array, MediaWiki\Extension\CommunityConfiguration\Validation\ValidationStatus returned - https://phabricator.wikimedia.org/T396790 [16:10:34] there we go [16:11:19] I'm hitting Special:Homepage which loads the config and haven't seen occurrences yet [16:11:27] no explosion of TypeErrors, so it seems like we threaded the needle. :) [16:11:44] <3 brennen and sergi0 thanks both [16:12:16] thanks to you both @brennen and @thcipriani <3 [16:12:28] We'll retrospect and learn from this [16:12:35] (03CR) 10Santiago Faci: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) (owner: 10Phuedx) [16:12:35] happy to help, thanks for the speedy response. [16:13:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:15:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10913880 (10Stevemunene) Thanks @VRiley-WMF moving on to re add the hosts to the cluster [16:23:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:33:53] !log dancy@deploy1003 Installing scap version "4.174.0" for 2 host(s) [16:35:43] !log dancy@deploy1003 Installation of scap version "4.174.0" completed for 2 hosts [16:36:11] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1157.eqiad.wmnet [16:38:06] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1157.eqiad.wmnet [16:40:10] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1158.eqiad.wmnet [16:42:22] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1158.eqiad.wmnet [16:42:43] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1159.eqiad.wmnet [16:45:34] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1159.eqiad.wmnet [16:47:37] RECOVERY - Hadoop DataNode on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [16:47:51] RECOVERY - Hadoop DataNode on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [16:48:15] RECOVERY - Hadoop DataNode on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [16:49:24] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1157.eqiad.wmnet [16:49:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10913988 (10ops-monitoring-bot) Host an-worker1157.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [17:06:23] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:10:23] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:12:51] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:12:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:14:08] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:14:10] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:14:32] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:16:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:16:06] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:17:58] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:18:00] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:19:28] (03PS1) 10Cwhite: add notes on ECS pipeline behavior [software/ecs] - 10https://gerrit.wikimedia.org/r/1156898 (https://phabricator.wikimedia.org/T395819) [17:20:19] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:20:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:21:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:21:14] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:21:42] (03PS2) 10Cwhite: add notes on ECS pipeline behavior [software/ecs] - 10https://gerrit.wikimedia.org/r/1156898 (https://phabricator.wikimedia.org/T395819) [17:21:42] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:21:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:21:58] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:22:03] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:22:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:22:23] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:22:26] (03PS3) 10Cwhite: add notes on ECS pipeline behavior [software/ecs] - 10https://gerrit.wikimedia.org/r/1156898 (https://phabricator.wikimedia.org/T395819) [17:23:02] (03CR) 10Cwhite: [C:03+2] add notes on ECS pipeline behavior [software/ecs] - 10https://gerrit.wikimedia.org/r/1156898 (https://phabricator.wikimedia.org/T395819) (owner: 10Cwhite) [17:23:28] (03Merged) 10jenkins-bot: add notes on ECS pipeline behavior [software/ecs] - 10https://gerrit.wikimedia.org/r/1156898 (https://phabricator.wikimedia.org/T395819) (owner: 10Cwhite) [17:24:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:24:46] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:25:14] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:25:16] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:26:29] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:26:31] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:26:57] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:28:33] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:29:02] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:29:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:29:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:29:23] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:29:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:35:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10914081 (10RobH) [17:37:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10914082 (10RobH) [17:37:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10914084 (10RobH) >>! In T394499#10908935, @BTullis wrote: > Hi @RobH - Sorry, I'm not 100% clear on which host you would like me to proceed. The... [17:38:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10914085 (10RobH) [17:40:25] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:42:36] (03PS1) 10Eevans: cassandra: add local_system_data_file_directory to instance manifest [puppet] - 10https://gerrit.wikimedia.org/r/1156901 [17:43:14] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156901 (owner: 10Eevans) [17:44:02] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:45:39] (03CR) 10Eevans: [C:03+2] cassandra: add local_system_data_file_directory to instance manifest [puppet] - 10https://gerrit.wikimedia.org/r/1156901 (owner: 10Eevans) [17:54:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:59:32] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [17:59:33] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [17:59:34] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [18:00:04] 06SRE, 10DNS, 06serviceops, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10914122 (10zolfeqar) Thanks a lot @Dzahn it works. [18:00:14] 06SRE, 10DNS, 06serviceops, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10914123 (10Dzahn) 05Open→03Resolved a:03Dzahn [18:04:27] 06SRE, 10DNS, 06serviceops, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10914130 (10Dzahn) Thanks to @jasmine_ for helping with the deployment. [18:10:10] FIRING: [3x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:31] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [18:14:19] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:14:21] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.006 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [18:14:42] wanted to restart that but .. ok then [18:17:19] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:20:44] (03PS1) 10Scott French: shellbox-syntaxhighlight: pilot bookworm-based httpd image (1 replica) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156442 (https://phabricator.wikimedia.org/T378128) [18:20:46] (03PS1) 10Scott French: shellbox-syntaxhighlight: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156443 (https://phabricator.wikimedia.org/T378128) [18:34:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10914264 (10KFrancis) Thank you! I'll send you the NDA via Docusign today. [18:38:26] (03PS1) 10Eevans: convenience script to cleanup Cassandra instance state [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 [18:41:35] 06SRE, 10Observability-Alerting: monitoring ACKs should be delivered via SMS - https://phabricator.wikimedia.org/T396894 (10Dzahn) 03NEW [19:14:18] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1019.eqiad.wmnet [19:14:58] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1019 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156306 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [19:15:03] (03CR) 10Andrew Bogott: [C:03+2] Update cloudcephosd1019 with probable new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156446 (owner: 10Andrew Bogott) [19:17:38] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895 (10RobH) 03NEW [19:19:27] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914404 (10RobH) [19:21:38] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1019.eqiad.wmnet [19:23:13] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1019.eqiad.wmnet [19:24:47] !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1019.eqiad.wmnet [19:27:48] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914443 (10RobH) Support Email Draft: Support, When the power maintainance took place via CHG0247347, we lost power to the secondary feeds in our rack B... [19:35:27] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcephosd1019.eqiad.wmnet [19:35:28] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1019.eqiad.wmnet [19:35:37] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1019.eqiad.wmnet'] [19:38:29] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914499 (10RobH) CS1117758 filed [19:40:03] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [19:41:33] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1019.eqiad.wmnet'] [19:44:48] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1019.eqiad.wmnet with OS bullseye [19:50:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:00:23] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1019.eqiad.wmnet with reason: host reimage [20:03:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:03:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1019.eqiad.wmnet with reason: host reimage [20:04:03] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [20:05:10] FIRING: [3x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:10:55] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914579 (10RobH) Ticket accepted, changed from open to in progress. No further updates at this time. [20:11:04] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914580 (10RobH) [20:11:20] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914581 (10RobH) [20:12:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T396703#10914583 (10VRiley-WMF) Created SR 211385491 for drive replacment. [20:13:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [20:19:48] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1019.eqiad.wmnet with OS bullseye [20:20:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T395685#10914587 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This drive has been replaced [20:24:14] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:24:20] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1019.eqiad.wmnet with OS bullseye [20:34:23] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [20:40:18] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1019.eqiad.wmnet with reason: host reimage [20:43:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1019.eqiad.wmnet with reason: host reimage [20:44:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): decommission relforge100[34] - https://phabricator.wikimedia.org/T390565#10914624 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF These have been decomissioned [20:44:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): decommission relforge100[34] - https://phabricator.wikimedia.org/T390565#10914627 (10VRiley-WMF) [20:58:12] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-magru loss of power redundancy - https://phabricator.wikimedia.org/T396895#10914653 (10RobH) 05Open→03Resolved a:03RobH Validated, the reported equipment is installed in U44, and the PSU on the equipment was Down. Reconn... [21:00:48] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1019.eqiad.wmnet with OS bullseye [21:24:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10914695 (10VRiley-WMF) [21:30:03] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:40:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:40:43] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:42:40] FIRING: HelmReleaseBadStatus: Helm release blunderbuss/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=blunderbuss - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:48:00] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:48:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:49:39] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:49:51] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:51:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:52:40] RESOLVED: HelmReleaseBadStatus: Helm release blunderbuss/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=blunderbuss - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:52:40] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:53:25] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:59:02] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:59:39] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1185 - vriley@cumin1002" [21:59:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1185 - vriley@cumin1002" [21:59:56] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:00:12] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [22:00:27] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:00:43] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185 [22:00:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1185 [22:02:32] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:02:55] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:03:19] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:03:31] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [22:03:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [22:03:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:54] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [22:05:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914746 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [22:06:11] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:08:20] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:10:52] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:11:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:12:55] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [22:13:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914747 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [22:14:19] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:14:25] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:14:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:17:25] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:18:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [22:18:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914751 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [22:19:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:57] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [22:24:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [22:24:38] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:35] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [22:25:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [22:26:01] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [22:26:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [22:42:12] vriley@cumin1002 reimage (PID 1853642) is awaiting input [22:42:44] vriley@cumin1002 reimage (PID 1853606) is awaiting input [23:15:24] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [23:15:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914823 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [23:16:03] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [23:16:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914824 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [23:22:03] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [23:22:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914825 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [23:27:11] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [23:28:09] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [23:31:50] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [23:31:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1157016 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1157016 (owner: 10TrainBranchBot) [23:39:56] vriley@cumin1002 reimage (PID 1860447) is awaiting input [23:42:58] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [23:43:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [23:50:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1157016 (owner: 10TrainBranchBot) [23:58:10] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [23:58:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10914854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b...