[00:00:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2061:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2061 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:19:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417748 (10phaultfinder) [00:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105822 [00:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105822 (owner: 10TrainBranchBot) [00:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417769 (10phaultfinder) [00:54:38] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:56:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105822 (owner: 10TrainBranchBot) [00:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417788 (10phaultfinder) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457 (owner: 10TrainBranchBot) [01:08:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105823 [01:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105823 (owner: 10TrainBranchBot) [01:10:41] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [01:16:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:19:31] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105823 (owner: 10TrainBranchBot) [01:54:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/fa51acc24359871768bd5a9c292a0a8e7818f566160b2b2edfc0d00c2a0c0ca1/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417821 (10phaultfinder) [02:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417823 (10phaultfinder) [02:14:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:16:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:18:43] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:19:31] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417830 (10phaultfinder) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:43] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [03:08:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [03:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417882 (10phaultfinder) [04:53:43] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:09:43] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:33:50] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@bcf6276]: (no justification provided) [05:37:56] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@bcf6276]: (no justification provided) (duration: 04m 11s) [05:49:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417976 (10phaultfinder) [06:04:43] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:10:43] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:19:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418001 (10phaultfinder) [06:36:43] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:39:43] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241220T0700) [07:01:43] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:08:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [07:08:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [07:09:43] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418044 (10phaultfinder) [07:19:39] (03PS1) 10Marostegui: mariadb: Add db2243 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1105832 (https://phabricator.wikimedia.org/T382425) [07:21:50] (03CR) 10Muehlenhoff: [C:03+2] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1105271 (owner: 10Muehlenhoff) [07:23:48] (03CR) 10Marostegui: [C:03+2] mariadb: Add db2243 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1105832 (https://phabricator.wikimedia.org/T382425) (owner: 10Marostegui) [07:24:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10418061 (10Marostegui) Done [07:24:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10418062 (10Marostegui) a:05Marostegui→03None [07:24:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10418063 (10Marostegui) [07:29:39] (03PS1) 10Muehlenhoff: maps::master Add toggle for planet sync [puppet] - 10https://gerrit.wikimedia.org/r/1105836 (https://phabricator.wikimedia.org/T381565) [07:34:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105836 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:39:36] (03PS1) 10Muehlenhoff: postgresql::master: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105874 [07:39:56] (03CR) 10CI reject: [V:04-1] postgresql::master: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105874 (owner: 10Muehlenhoff) [07:42:14] (03PS2) 10Muehlenhoff: postgresql::master: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105874 [07:45:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105874 (owner: 10Muehlenhoff) [07:50:28] (03Abandoned) 10Muehlenhoff: maps::master Add toggle for planet sync [puppet] - 10https://gerrit.wikimedia.org/r/1105836 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:52:08] (03CR) 10Marostegui: [C:03+1] mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241220T0800) [08:00:39] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1105773 (https://phabricator.wikimedia.org/T381851) (owner: 10AOkoth) [08:03:50] (03PS1) 10Muehlenhoff: planet_sync: Cleanup time handling [puppet] - 10https://gerrit.wikimedia.org/r/1105875 (https://phabricator.wikimedia.org/T381565) [08:06:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105875 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418107 (10phaultfinder) [08:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418136 (10phaultfinder) [08:53:55] (03CR) 10Elukey: [C:03+1] planet_sync: Cleanup time handling [puppet] - 10https://gerrit.wikimedia.org/r/1105875 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:54:13] (03CR) 10Elukey: [C:03+1] postgresql::master: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105874 (owner: 10Muehlenhoff) [09:02:27] (03PS1) 10Muehlenhoff: planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) [09:04:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:06:03] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1001-1004].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:06:44] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1008-1011].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:07:48] (03PS1) 10Muehlenhoff: postgresql::slave: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105877 [09:08:03] (03PS1) 10Stevemunene: Make WikimediaCampaignEvents use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105878 (https://phabricator.wikimedia.org/T377956) [09:09:25] (03PS2) 10Stevemunene: Make WikimediaCampaignEvents use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105878 (https://phabricator.wikimedia.org/T377956) [09:10:03] (03CR) 10Elukey: [C:03+1] "LGTM! It feels a little strange at first sight to see an exception to signal a success state, but I agree that this is good and easy solut" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) (owner: 10Volans) [09:11:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105877 (owner: 10Muehlenhoff) [09:11:41] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [09:14:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418157 (10phaultfinder) [09:14:57] (03PS1) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T377956) [09:15:43] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [09:16:23] (03CR) 10Elukey: [C:03+1] "I like the solution and the "DONE (pass)" approach, it should solve al concerns expressed in the task. LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655) (owner: 10Volans) [09:17:38] (03PS18) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [09:18:23] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [09:20:22] (03PS1) 10JMeybohm: sre.hosts.reimage: Extend --force help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1105880 [09:20:59] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1008-1011].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:21:18] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1001-1004].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:22:04] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1001-1004].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:23:02] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1001.eqiad.wmnet with OS bookworm [09:23:03] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1008-1011].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:23:59] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1008.eqiad.wmnet with OS bookworm [09:37:43] (03PS1) 10Stevemunene: Add linkeddata.cultureelerfgoed.nl to SPARQL allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1105882 (https://phabricator.wikimedia.org/T381717) [09:39:49] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [09:41:12] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1008.eqiad.wmnet with reason: host reimage [09:42:09] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1001.eqiad.wmnet with reason: host reimage [09:43:36] !log imported imposm3 0.11.1-1+deb12u1 to apt.wikimedia.org [09:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:53] !log imported imposm3 0.11.1-1+deb12u1 to apt.wikimedia.org T381565 [09:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [09:45:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1008.eqiad.wmnet with reason: host reimage [09:48:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1001.eqiad.wmnet with reason: host reimage [10:02:43] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1105880 (owner: 10JMeybohm) [10:04:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1008.eqiad.wmnet with OS bookworm [10:05:54] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1009.eqiad.wmnet with OS bookworm [10:09:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1001.eqiad.wmnet with OS bookworm [10:11:01] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1002.eqiad.wmnet with OS bookworm [10:17:11] (03CR) 10Elukey: [C:03+1] postgresql::slave: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105877 (owner: 10Muehlenhoff) [10:18:02] (03CR) 10Elukey: [C:03+1] planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:19:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:53] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1009.eqiad.wmnet with reason: host reimage [10:25:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1009.eqiad.wmnet with reason: host reimage [10:27:18] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database idwikivoyage (T381079) [10:27:23] T381079: Prepare and check storage layer for idwikivoyage - https://phabricator.wikimedia.org/T381079 [10:27:24] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1002.eqiad.wmnet with reason: host reimage [10:27:29] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database idwikivoyage (T381079) [10:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418302 (10phaultfinder) [10:30:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1002.eqiad.wmnet with reason: host reimage [10:37:32] (03PS1) 10Btullis: Bump the mapreduce history heap to 4096 on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1105884 (https://phabricator.wikimedia.org/T382575) [10:38:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4725/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105884 (https://phabricator.wikimedia.org/T382575) (owner: 10Btullis) [10:38:27] (03CR) 10Btullis: Bump the mapreduce history heap to 4096 on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1105884 (https://phabricator.wikimedia.org/T382575) (owner: 10Btullis) [10:41:00] (03CR) 10Btullis: [C:03+1] Add linkeddata.cultureelerfgoed.nl to SPARQL allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1105882 (https://phabricator.wikimedia.org/T381717) (owner: 10Stevemunene) [10:45:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1009.eqiad.wmnet with OS bookworm [10:46:50] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1010.eqiad.wmnet with OS bookworm [10:49:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1002.eqiad.wmnet with OS bookworm [10:50:52] (03PS3) 10Fabfur: varnish: pass WME HEAD reqs to pass for ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) [10:51:30] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1003.eqiad.wmnet with OS bookworm [10:55:23] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [11:00:33] (03CR) 10Btullis: [C:03+2] Bump the mapreduce history heap to 4096 on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1105884 (https://phabricator.wikimedia.org/T382575) (owner: 10Btullis) [11:04:01] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1010.eqiad.wmnet with reason: host reimage [11:07:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1010.eqiad.wmnet with reason: host reimage [11:08:01] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1003.eqiad.wmnet with reason: host reimage [11:08:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [11:08:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [11:10:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1003.eqiad.wmnet with reason: host reimage [11:27:08] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1010.eqiad.wmnet with OS bookworm [11:28:52] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1011.eqiad.wmnet with OS bookworm [11:29:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1003.eqiad.wmnet with OS bookworm [11:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418434 (10phaultfinder) [11:31:19] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1004.eqiad.wmnet with OS bookworm [11:41:21] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:45:27] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1011.eqiad.wmnet with reason: host reimage [11:47:44] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1004.eqiad.wmnet with reason: host reimage [11:48:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1011.eqiad.wmnet with reason: host reimage [11:52:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1004.eqiad.wmnet with reason: host reimage [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241220T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241220T1200). nyaa~ [12:02:33] (03CR) 10Urbanecm: "LGTM, pending deployment of Id70d05b05ebd5d8a1650208b28b435da1f89d49e (first train of 2025)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [12:05:30] (03CR) 10Kamila Součková: [C:03+1] sre.hosts.reimage: Extend --force help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1105880 (owner: 10JMeybohm) [12:08:46] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1011.eqiad.wmnet with OS bookworm [12:08:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1008-1011].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [12:10:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1004.eqiad.wmnet with OS bookworm [12:10:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1001-1004].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [12:15:03] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:15:36] (03CR) 10DCausse: team-search-platform: Add alert for wdqs-categories lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [12:26:15] RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 71%, RTA = 30.32 ms [12:28:53] (03PS2) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) [12:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418537 (10phaultfinder) [12:40:18] (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [12:40:48] (03PS19) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [12:44:11] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:46:03] (03CR) 10CI reject: [V:04-1] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [13:00:11] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:03:03] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:08:15] (03PS1) 10Btullis: Double the maximum number of files in an HDFS directory [puppet] - 10https://gerrit.wikimedia.org/r/1105893 (https://phabricator.wikimedia.org/T380674) [13:09:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4726/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105893 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [13:09:51] (03PS5) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) [13:12:25] (03PS20) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [13:12:59] (03CR) 10Stevemunene: [C:03+1] "Looks good, Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1105893 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [13:20:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:03] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:43:18] (03CR) 10Elukey: "Fixed some label names, followed what done in https://gerrit.wikimedia.org/r/c/mediawiki/services/kartotherian/+/556250." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey) [13:43:21] (03CR) 10Btullis: [V:03+1 C:03+2] Double the maximum number of files in an HDFS directory [puppet] - 10https://gerrit.wikimedia.org/r/1105893 (https://phabricator.wikimedia.org/T380674) (owner: 10Btullis) [13:48:22] (03CR) 10Filippo Giunchedi: prometheus: deploy instances from a single configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:50:50] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10418717 (10cmooney) p:05Triage→03Low [14:06:11] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [14:15:41] 06SRE, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10418777 (10BTullis) >>! In T368098#10392790, @Marostegui wrote: > @BTullis do you think you could find some time to explore this idea. Yes, I think that this i... [14:19:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:57] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10418786 (10Jhancock.wm) @Andrew so I thought i fixed it. turns out its not working. I was having the same issue after running reimage. After it ran once, i... [14:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418797 (10phaultfinder) [14:25:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:44] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [14:43:48] (03PS2) 10Volans: api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) [14:43:48] (03PS2) 10Volans: api: allow to skip the START log to SAL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655) [14:47:45] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:48:35] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Swift [14:49:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10418831 (10phaultfinder) [15:01:48] (03PS1) 10DCausse: eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) [15:02:46] (03CR) 10DCausse: [C:04-1] "needs to deploy 0.10.0 to the docker registry first" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [15:06:09] (03PS1) 10Herron: pyrra: remove liftwing slos [puppet] - 10https://gerrit.wikimedia.org/r/1105921 (https://phabricator.wikimedia.org/T368953) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:49] (03CR) 10Ottomata: [C:03+1] "Hm, you might want to make this change in eventstreams-internal too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [15:07:02] (03CR) 10Ottomata: [C:03+1] "Feel free to deploy when you are ready." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [15:16:13] (03PS3) 10Herron: pyrra: remove liftwing slos [puppet] - 10https://gerrit.wikimedia.org/r/1105921 (https://phabricator.wikimedia.org/T368953) [15:17:33] (03CR) 10DCausse: [C:04-1] "ah good point, better safe than sorry, will configure them there." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [15:17:36] (03PS2) 10DCausse: eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) [15:17:44] (03CR) 10Herron: [C:03+2] pyrra: remove liftwing slos [puppet] - 10https://gerrit.wikimedia.org/r/1105921 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [15:18:26] (03CR) 10Bking: team-search-platform: Add alert for wdqs-categories lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [15:21:37] (03CR) 10DCausse: team-search-platform: Add alert for wdqs-categories lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [15:23:51] (03CR) 10JHathaway: [C:03+1] postgresql::master: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105874 (owner: 10Muehlenhoff) [15:24:04] (03CR) 10JHathaway: [C:03+1] postgresql::slave: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105877 (owner: 10Muehlenhoff) [15:37:01] (03CR) 10Ottomata: "You do now! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [15:39:44] (03CR) 10DCausse: "ah thanks! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [16:04:24] (03CR) 10Ahmon Dancy: [C:03+1] dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes) [16:04:47] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [16:12:43] (03CR) 10JMeybohm: [C:03+2] sre.hosts.reimage: Extend --force help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1105880 (owner: 10JMeybohm) [16:16:09] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10419000 (10Eevans) So that's a full week and everything looks fine (RAID still intact, no dmesg errors). I'll optimistically close this for now. 🤞 [16:16:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10419001 (10Eevans) 05Open→03Resolved [16:18:00] (03Merged) 10jenkins-bot: sre.hosts.reimage: Extend --force help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1105880 (owner: 10JMeybohm) [16:19:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419010 (10phaultfinder) [16:47:22] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@8c5744d]: Deploying latest analytics Airflow instance DAGs. T377852. [16:47:26] T377852: Tune Reconciliation mechanism to do historic runs (all revisions, all wikis) - https://phabricator.wikimedia.org/T377852 [16:48:20] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@8c5744d]: Deploying latest analytics Airflow instance DAGs. T377852. (duration: 00m 58s) [17:06:46] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1023-1024].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:08:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1023.eqiad.wmnet with OS bookworm [17:09:46] 06SRE, 06cloud-services-team, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#10419111 (10fnegri) [17:20:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:53] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1023.eqiad.wmnet with reason: host reimage [17:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419163 (10phaultfinder) [17:31:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1023.eqiad.wmnet with reason: host reimage [17:34:48] PROBLEM - MariaDB Replica SQL: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: huwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:35:09] o/ [17:35:11] lovely [17:35:16] !incidents [17:35:17] 5548 (UNACKED) db2168 (paged)/MariaDB Replica SQL: s7 (paged) [17:35:21] swfrench-wmf: this is one of those cases Amir mentioned, it's in the doc [17:35:23] I'll fix that [17:35:27] !ack 5548 [17:35:27] 5548 (ACKED) db2168 (paged)/MariaDB Replica SQL: s7 (paged) [17:35:39] cdanis: yes, indeed - thanks! [17:35:56] marostegui: I'm happy to as "practice" unless you're already on it [17:36:46] swfrench-wmf: I wanted to upgrade mariadb too to the version that avoids those [17:36:51] But you can go ahead [17:36:56] And then I can upgrade [17:37:41] marostegui: sounds good, I'll give it a try now and ping you when I'm done [17:38:15] ok [17:38:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2168.codfw.wmnet with reason: maintenance [17:38:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2168.codfw.wmnet with reason: maintenance [17:39:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2168', diff saved to https://phabricator.wikimedia.org/P71735 and previous config saved to /var/cache/conftool/dbconfig/20241220-173922-marostegui.json [17:39:48] RECOVERY - MariaDB Replica SQL: s7 #page on db2168 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:39:55] 🎉 [17:39:56] marostegui: done [17:40:07] swfrench-wmf: thanks, restarting [17:41:40] done [17:41:47] \o/ [17:41:51] thank you! [17:41:52] swfrench-wmf: would you update the doc? I will take care of repooling the host [17:42:17] Doh, I just got that triggered and resolved simultaneously [17:42:20] ack, will do [17:42:37] thank you! [17:42:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71736 and previous config saved to /var/cache/conftool/dbconfig/20241220-174246-root.json [17:43:47] marostegui: I'll add a note in the comment column that you've updated it to 10.6.20? [17:44:05] swfrench-wmf: sounds good thank you! [17:47:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:50:30] RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:32] (03CR) 10AOkoth: [C:03+2] admin: Add ammarpad to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1105773 (https://phabricator.wikimedia.org/T381851) (owner: 10AOkoth) [17:51:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:52:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1023.eqiad.wmnet with OS bookworm [17:53:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1024.eqiad.wmnet with OS bookworm [17:54:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10419238 (10Arnoldokoth) @Ammarpad This is good to go. Kindly test and feel free to resolve if everything works fine. [17:54:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10419241 (10Arnoldokoth) [17:57:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71737 and previous config saved to /var/cache/conftool/dbconfig/20241220-175751-root.json [18:00:39] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@7fecc64]: Pickup hotfix for T377852. [18:00:43] T377852: Tune Reconciliation mechanism to do historic runs (all revisions, all wikis) - https://phabricator.wikimedia.org/T377852 [18:02:42] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@7fecc64]: Pickup hotfix for T377852. (duration: 02m 03s) [18:06:16] (03CR) 10CDanis: [C:03+1] "LGTM after holidays" [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:06:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:07:01] (03CR) 10CDanis: [C:03+1] "lgtm after holidays!" [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:07:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [18:12:32] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1024.eqiad.wmnet with reason: host reimage [18:12:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71738 and previous config saved to /var/cache/conftool/dbconfig/20241220-181256-root.json [18:15:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1024.eqiad.wmnet with reason: host reimage [18:19:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71739 and previous config saved to /var/cache/conftool/dbconfig/20241220-182801-root.json [18:35:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1024.eqiad.wmnet with OS bookworm [18:35:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1023-1024].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [18:43:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71741 and previous config saved to /var/cache/conftool/dbconfig/20241220-184307-root.json [18:44:39] PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:45:02] !incidents [18:45:03] 5549 (UNACKED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [18:45:03] 5548 (RESOLVED) db2168 (paged)/MariaDB Replica SQL: s7 (paged) [18:45:10] !ack 5549 [18:45:10] 5549 (ACKED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [18:45:17] so, this is presumably dumps [18:46:42] https://phabricator.wikimedia.org/T368098 ? [18:46:44] this one I got [18:47:07] I think so yes :( [18:47:37] confirmed that this is dumps / vslow and nominal weight 1 for everything else [18:52:29] would it make sense to set the section weight to 0 (avoid any interactive traffic) while leaving the "slow things" group weights non-zero, and potentially downtime it? [18:52:57] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1099312 (owner: 10Ncmonitor) [18:53:08] swfrench-wmf: if you set it to 0 MW will not check for its lag so it's better to leave 1 [18:54:46] marostegui: oh, interesting! I didn't realize 0 would have additional effects beyond "don't send queries there" - very good to know [18:59:33] so, I suppose we don't really have any recourse here, other than stop dumps, which we probably do not want to do unless they cause wider disruption [18:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419365 (10phaultfinder) [19:01:40] swfrench-wmf: yeh, there's some discussion to move them to some other servers [19:04:31] marostegui: ah, I see - I didn't realize that might be happening in parallel with the k8s migration. [19:10:00] (03PS1) 10Andrew Bogott: dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 [19:10:20] (03CR) 10CI reject: [V:04-1] dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [19:15:55] (03PS2) 10Andrew Bogott: dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 [19:15:56] (03PS1) 10Andrew Bogott: cloud-vps: increase # of attempts with dns resolving [puppet] - 10https://gerrit.wikimedia.org/r/1105945 (https://phabricator.wikimedia.org/T374830) [19:16:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105945 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [19:18:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [19:22:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10419425 (10LSobanski) 05Open→03Resolved It's been two weeks since the fix was deployed. Please reopen this task if you notice any furth... [19:24:42] (03PS3) 10Andrew Bogott: dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 [19:25:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [19:31:12] (03PS4) 10Andrew Bogott: dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 [19:31:12] (03PS2) 10Andrew Bogott: cloud-vps: increase # of attempts with dns resolving [puppet] - 10https://gerrit.wikimedia.org/r/1105945 (https://phabricator.wikimedia.org/T374830) [19:33:41] RECOVERY - MariaDB Replica Lag: s1 #page on db1206 is OK: OK slave_sql_lag Replication lag: 59.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419562 (10phaultfinder) [20:30:04] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1313765128 and 68 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:31:13] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616 (10MSantos) 03NEW [20:35:04] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 59152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:53:35] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10419601 (10Andrew) I will have another go! [21:13:38] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [21:19:41] PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:19:53] !incidents [21:19:54] 5550 (UNACKED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [21:19:54] 5549 (RESOLVED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [21:19:54] 5548 (RESOLVED) db2168 (paged)/MariaDB Replica SQL: s7 (paged) [21:19:58] !ack 5550 [21:19:58] 5550 (ACKED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [21:21:55] 06SRE, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10419702 (10Marostegui) Thanks @btullis. These really needs some priority as it keeps paging the on-call sre. Today we got two pages for replication lag. [21:24:59] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [21:26:15] FYI, since there's nothing we can do about this, I am going to look into scheduling a downtime scoped to this specific check. [21:29:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [21:29:41] RECOVERY - MariaDB Replica Lag: s1 #page on db1206 is OK: OK slave_sql_lag Replication lag: 58.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:35:07] 06SRE, 06Data-Platform, 06DBA, 10Dumps 2.0, 10Dumps-Generation: Repeated replication lag pages for db1206 - https://phabricator.wikimedia.org/T382625 (10Scott_French) 03NEW [21:46:40] 06SRE, 06Data-Platform, 06DBA, 10Dumps 2.0, 10Dumps-Generation: Repeated replication lag pages for db1206 - https://phabricator.wikimedia.org/T382625#10419762 (10Scott_French) I've downtimed the service through 8:00 UTC on Monday 12/23. FYI @akosiaris and @MoritzMuehlenhoff as next business-hours rotati... [21:48:55] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [21:52:34] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [21:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419769 (10phaultfinder) [22:12:10] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [22:17:42] (03PS1) 10MSantos: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) [22:19:32] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:16] RECOVERY - Host ripe-atlas-eqsin is UP: PING WARNING - Packet loss = 60%, RTA = 30.37 ms [22:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419832 (10phaultfinder) [23:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10419845 (10phaultfinder)