[00:00:03] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P41057 and previous config saved to /var/cache/conftool/dbconfig/20221125-000013-marostegui.json [00:04:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [00:04:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [00:04:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P41058 and previous config saved to /var/cache/conftool/dbconfig/20221125-000421-ladsgroup.json [00:04:27] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:06:11] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P41059 and previous config saved to /var/cache/conftool/dbconfig/20221125-000614-ladsgroup.json [00:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P41060 and previous config saved to /var/cache/conftool/dbconfig/20221125-000630-ladsgroup.json [00:15:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P41061 and previous config saved to /var/cache/conftool/dbconfig/20221125-001520-marostegui.json [00:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P41062 and previous config saved to /var/cache/conftool/dbconfig/20221125-002119-ladsgroup.json [00:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P41063 and previous config saved to /var/cache/conftool/dbconfig/20221125-002137-ladsgroup.json [00:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P41064 and previous config saved to /var/cache/conftool/dbconfig/20221125-003026-marostegui.json [00:36:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P41065 and previous config saved to /var/cache/conftool/dbconfig/20221125-003643-ladsgroup.json [00:45:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P41066 and previous config saved to /var/cache/conftool/dbconfig/20221125-004533-marostegui.json [00:45:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:45:40] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [00:45:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:45:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T321126)', diff saved to https://phabricator.wikimedia.org/P41067 and previous config saved to /var/cache/conftool/dbconfig/20221125-004554-marostegui.json [00:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T321126)', diff saved to https://phabricator.wikimedia.org/P41068 and previous config saved to /var/cache/conftool/dbconfig/20221125-004805-marostegui.json [00:51:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P41069 and previous config saved to /var/cache/conftool/dbconfig/20221125-005150-ladsgroup.json [00:51:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:03:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P41070 and previous config saved to /var/cache/conftool/dbconfig/20221125-010311-marostegui.json [01:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P41071 and previous config saved to /var/cache/conftool/dbconfig/20221125-011818-marostegui.json [01:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T321126)', diff saved to https://phabricator.wikimedia.org/P41072 and previous config saved to /var/cache/conftool/dbconfig/20221125-013324-marostegui.json [01:33:31] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:07] PROBLEM - SSH on mw1331.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:23] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [02:32:29] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [04:01:12] (03PS1) 10KartikMistry: Content Translation: Reverse MT threshold for Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860701 (https://phabricator.wikimedia.org/T323721) [04:49:43] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:06:11] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:16:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:34:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2165.codfw.wmnet with reason: Maintenance [05:34:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2165.codfw.wmnet with reason: Maintenance [05:44:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1109.eqiad.wmnet with reason: Maintenance [05:44:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1109.eqiad.wmnet with reason: Maintenance [05:45:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [05:46:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [05:51:29] (03PS1) 10Marostegui: control-mariadb-10.5: Remove file [software] - 10https://gerrit.wikimedia.org/r/860702 [05:52:36] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.5: Remove file [software] - 10https://gerrit.wikimedia.org/r/860702 (owner: 10Marostegui) [05:54:02] (03Merged) 10jenkins-bot: control-mariadb-10.5: Remove file [software] - 10https://gerrit.wikimedia.org/r/860702 (owner: 10Marostegui) [05:54:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1112.eqiad.wmnet with reason: Maintenance [05:54:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1112.eqiad.wmnet with reason: Maintenance [05:54:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:55:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:55:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T321126)', diff saved to https://phabricator.wikimedia.org/P41073 and previous config saved to /var/cache/conftool/dbconfig/20221125-055517-marostegui.json [05:55:23] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:02:12] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add pip to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844319 (owner: 10Giuseppe Lavagetto) [06:05:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321126)', diff saved to https://phabricator.wikimedia.org/P41074 and previous config saved to /var/cache/conftool/dbconfig/20221125-060530-marostegui.json [06:05:37] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:06:42] (03PS1) 10Giuseppe Lavagetto: Remove the parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/860703 [06:13:28] (03PS1) 10Giuseppe Lavagetto: push-notifications: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860704 [06:13:53] (03PS1) 10Giuseppe Lavagetto: shellbox: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860705 [06:14:25] (03PS1) 10Giuseppe Lavagetto: similar-users: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860706 [06:14:39] (03PS1) 10Giuseppe Lavagetto: tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 [06:14:49] (03PS1) 10Giuseppe Lavagetto: termbox: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860708 [06:15:04] (03PS1) 10Giuseppe Lavagetto: toolhub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860709 [06:16:54] (03PS1) 10Giuseppe Lavagetto: zotero: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860710 [06:17:01] (03PS1) 10Giuseppe Lavagetto: thumbor: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860711 [06:17:47] (03CR) 10CI reject: [V: 04-1] thumbor: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860711 (owner: 10Giuseppe Lavagetto) [06:19:00] (03CR) 10CI reject: [V: 04-1] tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 (owner: 10Giuseppe Lavagetto) [06:19:11] RECOVERY - SSH on mw1331.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:20:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41075 and previous config saved to /var/cache/conftool/dbconfig/20221125-062036-marostegui.json [06:21:49] PROBLEM - SSH on wdqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:30:39] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860712 [06:32:08] (03PS2) 10Giuseppe Lavagetto: mediawiki: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860712 [06:35:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41076 and previous config saved to /var/cache/conftool/dbconfig/20221125-063543-marostegui.json [06:39:32] (03PS1) 10Giuseppe Lavagetto: mediawiki-dev: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860713 [06:43:19] (03PS1) 10Giuseppe Lavagetto: secrets: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860714 [06:48:43] (03PS1) 10Giuseppe Lavagetto: knative-serving: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860715 [06:48:58] (03PS1) 10Giuseppe Lavagetto: knative-serving-crds: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860716 [06:49:20] (03Abandoned) 10Giuseppe Lavagetto: knative-serving-crds: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860716 (owner: 10Giuseppe Lavagetto) [06:49:59] (03CR) 10CI reject: [V: 04-1] knative-serving: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860715 (owner: 10Giuseppe Lavagetto) [06:50:03] (03PS1) 10Giuseppe Lavagetto: kserve-inference: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860717 [06:50:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321126)', diff saved to https://phabricator.wikimedia.org/P41077 and previous config saved to /var/cache/conftool/dbconfig/20221125-065049-marostegui.json [06:50:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:50:56] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:51:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:59:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:59:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T321126)', diff saved to https://phabricator.wikimedia.org/P41078 and previous config saved to /var/cache/conftool/dbconfig/20221125-065930-marostegui.json [06:59:36] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321126)', diff saved to https://phabricator.wikimedia.org/P41079 and previous config saved to /var/cache/conftool/dbconfig/20221125-070940-marostegui.json [07:09:48] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:13:55] (03CR) 10Slyngshede: [V: 03+2] Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 (owner: 10Slyngshede) [07:16:11] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 (owner: 10Slyngshede) [07:19:23] (03PS2) 10Giuseppe Lavagetto: thumbor: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860711 [07:22:43] RECOVERY - SSH on wdqs1008.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41080 and previous config saved to /var/cache/conftool/dbconfig/20221125-072447-marostegui.json [07:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41081 and previous config saved to /var/cache/conftool/dbconfig/20221125-073953-marostegui.json [07:44:11] (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/860820 [07:45:25] (03CR) 10Marostegui: [C: 03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/860820 (owner: 10Marostegui) [07:54:12] (03PS1) 10Marostegui: Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/860497 [07:55:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321126)', diff saved to https://phabricator.wikimedia.org/P41082 and previous config saved to /var/cache/conftool/dbconfig/20221125-075500-marostegui.json [07:55:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:55:08] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:55:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:55:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321126)', diff saved to https://phabricator.wikimedia.org/P41083 and previous config saved to /var/cache/conftool/dbconfig/20221125-075521-marostegui.json [07:55:36] (03CR) 10Marostegui: [C: 03+2] Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/860497 (owner: 10Marostegui) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221125T0800) [08:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:04:35] (03PS2) 10Giuseppe Lavagetto: knative-serving: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860715 [08:05:19] (03CR) 10CI reject: [V: 04-1] knative-serving: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860715 (owner: 10Giuseppe Lavagetto) [08:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321126)', diff saved to https://phabricator.wikimedia.org/P41084 and previous config saved to /var/cache/conftool/dbconfig/20221125-080521-marostegui.json [08:05:28] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:15] (03PS3) 10Giuseppe Lavagetto: knative-serving: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860715 [08:09:16] !log rebalance Ganeti group C/codfw following reboots [08:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:50] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Envoy on debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/860579 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:20:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41085 and previous config saved to /var/cache/conftool/dbconfig/20221125-082027-marostegui.json [08:28:08] (03PS2) 10Giuseppe Lavagetto: tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 [08:29:03] (03CR) 10Filippo Giunchedi: o11y: more lenient logstash kafka consumer lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/860609 (owner: 10Filippo Giunchedi) [08:29:23] (03PS1) 10Giuseppe Lavagetto: flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 [08:29:57] (03PS1) 10Giuseppe Lavagetto: function-evaluator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860829 [08:30:07] (03PS1) 10Giuseppe Lavagetto: function-orchestrator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860830 [08:30:09] (03CR) 10CI reject: [V: 04-1] flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 (owner: 10Giuseppe Lavagetto) [08:33:25] (03CR) 10Muehlenhoff: [C: 03+2] Set role contacts for webperf* roles to o11y [puppet] - 10https://gerrit.wikimedia.org/r/858294 (owner: 10Muehlenhoff) [08:35:26] !log installing libarchive security updates [08:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41086 and previous config saved to /var/cache/conftool/dbconfig/20221125-083534-marostegui.json [08:43:32] (03PS32) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [08:44:31] (03CR) 10David Caro: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [08:45:25] (03PS1) 10Jcrespo: Fix parameter naming on deletion of an S3 object [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860831 (https://phabricator.wikimedia.org/T323796) [08:46:05] (03CR) 10CI reject: [V: 04-1] Fix parameter naming on deletion of an S3 object [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860831 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [08:47:30] (03PS33) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [08:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321126)', diff saved to https://phabricator.wikimedia.org/P41087 and previous config saved to /var/cache/conftool/dbconfig/20221125-085040-marostegui.json [08:50:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:50:47] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:50:47] (03PS1) 10Jcrespo: Prepare for release 0.1.4 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860832 (https://phabricator.wikimedia.org/T323796) [08:50:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:51:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T321126)', diff saved to https://phabricator.wikimedia.org/P41088 and previous config saved to /var/cache/conftool/dbconfig/20221125-085101-marostegui.json [08:51:04] 10SRE, 10Observability-Logging: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10fgiunchedi) [08:51:28] (03CR) 10CI reject: [V: 04-1] Prepare for release 0.1.4 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860832 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [08:58:59] <_joe_> uhm why autoop doesn't work [09:00:30] <_joe_> ok that looks better [09:00:49] (03PS2) 10Jcrespo: Fix parameter naming on deletion of an S3 object [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860831 (https://phabricator.wikimedia.org/T323796) [09:00:51] (03PS2) 10Jcrespo: Prepare for release 0.1.4 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860832 (https://phabricator.wikimedia.org/T323796) [09:00:53] (03PS1) 10Jcrespo: Fix minor syntax issue while rising an exception [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860833 [09:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321126)', diff saved to https://phabricator.wikimedia.org/P41089 and previous config saved to /var/cache/conftool/dbconfig/20221125-090102-marostegui.json [09:01:09] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:01:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:06:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:36] (03CR) 10Jcrespo: [C: 03+2] Fix minor syntax issue while rising an exception [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860833 (owner: 10Jcrespo) [09:11:44] (03CR) 10Jcrespo: [C: 03+2] Fix parameter naming on deletion of an S3 object [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860831 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [09:11:51] (03CR) 10Jcrespo: [C: 03+2] Prepare for release 0.1.4 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860832 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [09:15:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:16:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41090 and previous config saved to /var/cache/conftool/dbconfig/20221125-091609-marostegui.json [09:28:28] (03PS1) 10Muehlenhoff: Make ganeti2031 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/860834 [09:31:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41091 and previous config saved to /var/cache/conftool/dbconfig/20221125-093115-marostegui.json [09:34:43] (03PS1) 10Jaime Nuche: create group for Release Engineering members [puppet] - 10https://gerrit.wikimedia.org/r/860836 [09:34:45] (03PS1) 10Jaime Nuche: jenkins: add RelEng deploy user for Jenkins Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/860837 [09:37:59] (03PS2) 10Giuseppe Lavagetto: flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 [09:38:09] (03CR) 10David Caro: "Doing a dummy test on dbusers-nfs-1.testlabs.eqiad1.wikimedia.cloud, after cherry-picking on the puppetmaster and running puppet until it " [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:40:11] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2031 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/860834 (owner: 10Muehlenhoff) [09:44:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:45] (03PS1) 10Jcrespo: mediabackups: Add new policy intended for admin deletion of files [puppet] - 10https://gerrit.wikimedia.org/r/860838 (https://phabricator.wikimedia.org/T323796) [09:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321126)', diff saved to https://phabricator.wikimedia.org/P41092 and previous config saved to /var/cache/conftool/dbconfig/20221125-094622-marostegui.json [09:46:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:46:29] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:46:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T321126)', diff saved to https://phabricator.wikimedia.org/P41093 and previous config saved to /var/cache/conftool/dbconfig/20221125-094643-marostegui.json [09:49:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:57:27] (03PS2) 10David Caro: harbor: remove support for (03PS2) 10David Caro: harbor: remove unused harbor::db module/role [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) [09:57:31] (03PS4) 10David Caro: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [09:57:33] (03CR) 10David Caro: toolforge harbor: update certs with acmechief (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [10:04:42] (03CR) 10Vgutierrez: setup.py: update dependencies for bullseye (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [10:04:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321126)', diff saved to https://phabricator.wikimedia.org/P41094 and previous config saved to /var/cache/conftool/dbconfig/20221125-100456-marostegui.json [10:05:03] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:10:58] (03PS5) 10David Caro: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [10:12:54] (03PS6) 10David Caro: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [10:14:32] (03PS1) 10Elukey: turnilo: add new time to first byte measure [puppet] - 10https://gerrit.wikimedia.org/r/860843 [10:16:47] (03PS7) 10David Caro: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [10:19:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41095 and previous config saved to /var/cache/conftool/dbconfig/20221125-102002-marostegui.json [10:21:19] (03PS1) 10Filippo Giunchedi: New upstream release [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 [10:22:45] (03CR) 10David Caro: "Now it's ready :)" [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [10:31:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:25] (03PS1) 10Jcrespo: Check 1 row was affected after metadata deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860850 (https://phabricator.wikimedia.org/T323796) [10:35:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41096 and previous config saved to /var/cache/conftool/dbconfig/20221125-103509-marostegui.json [10:36:34] (03PS2) 10Filippo Giunchedi: New upstream release [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 (https://phabricator.wikimedia.org/T303154) [10:41:33] (03CR) 10Muehlenhoff: New upstream release (031 comment) [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 (https://phabricator.wikimedia.org/T303154) (owner: 10Filippo Giunchedi) [10:41:51] (03Abandoned) 10Jcrespo: Check 1 row was affected after metadata deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860850 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [10:46:19] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [10:46:40] (03PS14) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [10:47:00] (03PS3) 10Filippo Giunchedi: New upstream release [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 (https://phabricator.wikimedia.org/T303154) [10:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321126)', diff saved to https://phabricator.wikimedia.org/P41097 and previous config saved to /var/cache/conftool/dbconfig/20221125-105015-marostegui.json [10:50:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:50:22] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:50:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T321126)', diff saved to https://phabricator.wikimedia.org/P41098 and previous config saved to /var/cache/conftool/dbconfig/20221125-105036-marostegui.json [10:57:25] (03CR) 10Stevemunene: [C: 03+1] turnilo: add new time to first byte measure [puppet] - 10https://gerrit.wikimedia.org/r/860843 (owner: 10Elukey) [10:58:57] (03CR) 10Vgutierrez: "recheck" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:01:55] (03PS1) 10Jcrespo: deletion: Fix bug in query for metadata deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860853 (https://phabricator.wikimedia.org/T323796) [11:03:03] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/860843 (owner: 10Elukey) [11:04:07] RECOVERY - Ganeti memory on ganeti1011 is OK: OK Memory 82% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [11:05:03] (03PS1) 10Muehlenhoff: Set role_contacts for ml-cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/860854 [11:06:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321126)', diff saved to https://phabricator.wikimedia.org/P41099 and previous config saved to /var/cache/conftool/dbconfig/20221125-110642-marostegui.json [11:06:48] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:07:32] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it!" [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 (https://phabricator.wikimedia.org/T303154) (owner: 10Filippo Giunchedi) [11:10:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the quick review!" [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 (https://phabricator.wikimedia.org/T303154) (owner: 10Filippo Giunchedi) [11:12:14] (03PS1) 10Slyngshede: Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 [11:14:41] (03CR) 10Jcrespo: [C: 03+2] deletion: Fix bug in query for metadata deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/860853 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [11:15:52] (03PS2) 10Slyngshede: Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 [11:17:42] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) Yes, I think I agree; although I'm not sure if Riccardo manage... [11:20:53] (03PS3) 10Vgutierrez: setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P41100 and previous config saved to /var/cache/conftool/dbconfig/20221125-112148-marostegui.json [11:22:34] (03CR) 10Elukey: [C: 03+2] turnilo: add new time to first byte measure [puppet] - 10https://gerrit.wikimedia.org/r/860843 (owner: 10Elukey) [11:24:51] !log restart turnilo on an-tool1007 to pick up new settings for webrequest_sampled_live [11:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [11:35:06] (03PS4) 10Vgutierrez: setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P41101 and previous config saved to /var/cache/conftool/dbconfig/20221125-113654-marostegui.json [11:38:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [11:39:58] (03PS15) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [11:47:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [11:49:16] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [11:52:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321126)', diff saved to https://phabricator.wikimedia.org/P41102 and previous config saved to /var/cache/conftool/dbconfig/20221125-115201-marostegui.json [11:52:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:52:08] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:52:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:52:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T321126)', diff saved to https://phabricator.wikimedia.org/P41103 and previous config saved to /var/cache/conftool/dbconfig/20221125-115222-marostegui.json [12:05:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321126)', diff saved to https://phabricator.wikimedia.org/P41104 and previous config saved to /var/cache/conftool/dbconfig/20221125-120527-marostegui.json [12:05:34] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:08:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2031.codfw.wmnet to cluster codfw and group B [12:13:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2031.codfw.wmnet to cluster codfw and group B [12:13:46] (03PS1) 10Jaime Nuche: keyholder: add new identity for deploy-releng user [labs/private] - 10https://gerrit.wikimedia.org/r/860865 [12:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P41105 and previous config saved to /var/cache/conftool/dbconfig/20221125-122033-marostegui.json [12:26:04] 10SRE, 10Infrastructure-Foundations, 10LDAP: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10MoritzMuehlenhoff) [12:26:53] !log installing vim security updates [12:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:45] (03PS3) 10Kosta Harlan: [WIP] GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [12:28:35] (03PS4) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [12:28:39] (03PS5) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [12:32:34] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:20] (03PS1) 10Kosta Harlan: GrowthExperiments: Start newimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [12:33:39] (03PS1) 10Muehlenhoff: buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/860868 [12:35:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P41106 and previous config saved to /var/cache/conftool/dbconfig/20221125-123540-marostegui.json [12:37:25] (03CR) 10Hashar: keyholder: add new identity for deploy-releng user (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/860865 (owner: 10Jaime Nuche) [12:45:37] (03CR) 10Muehlenhoff: [C: 03+2] buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/860868 (owner: 10Muehlenhoff) [12:47:56] (03PS1) 10Muehlenhoff: Make ganeti2032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/860873 (https://phabricator.wikimedia.org/T313856) [12:50:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321126)', diff saved to https://phabricator.wikimedia.org/P41107 and previous config saved to /var/cache/conftool/dbconfig/20221125-125046-marostegui.json [12:50:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:50:53] (03PS1) 10Jbond: environments: add environments file [puppet] - 10https://gerrit.wikimedia.org/r/860874 [12:50:53] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:51:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:53:49] (03PS2) 10Jaime Nuche: keyholder: add new identity for deploy-releng user [labs/private] - 10https://gerrit.wikimedia.org/r/860865 [12:55:05] (03CR) 10Jaime Nuche: keyholder: add new identity for deploy-releng user (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/860865 (owner: 10Jaime Nuche) [12:56:00] (03CR) 10Hashar: [V: 03+2 C: 03+2] keyholder: add new identity for deploy-releng user [labs/private] - 10https://gerrit.wikimedia.org/r/860865 (owner: 10Jaime Nuche) [12:59:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:59:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T321126)', diff saved to https://phabricator.wikimedia.org/P41108 and previous config saved to /var/cache/conftool/dbconfig/20221125-125935-marostegui.json [12:59:43] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:02:16] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:49] (03CR) 10Jaime Nuche: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/860837 (owner: 10Jaime Nuche) [13:11:42] !log re-enabling puppet on wcqs1001 - data transfer completed - T321605 [13:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:48] T321605: Make WCQS/WDQS data transfer cookbook more reliable - https://phabricator.wikimedia.org/T321605 [13:11:49] inflatador: fyi ^ [13:11:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:16:02] (03PS1) 10Muehlenhoff: Update lvs-canary [puppet] - 10https://gerrit.wikimedia.org/r/860878 [13:18:14] RECOVERY - puppet last run on wcqs1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:24:21] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:28:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321126)', diff saved to https://phabricator.wikimedia.org/P41109 and previous config saved to /var/cache/conftool/dbconfig/20221125-132853-marostegui.json [13:29:00] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:30:09] (03PS16) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [13:30:37] (03PS17) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [13:39:27] (03PS1) 10Hashar: Document how to test a JavaScript Gerrit plugin [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860885 (https://phabricator.wikimedia.org/T214068) [13:42:00] (03PS1) 10Muehlenhoff: Add role_contacts for mwlog [puppet] - 10https://gerrit.wikimedia.org/r/860886 [13:44:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41110 and previous config saved to /var/cache/conftool/dbconfig/20221125-134359-marostegui.json [13:44:38] (03PS18) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [13:46:37] (03CR) 10LMata: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/858294 (owner: 10Muehlenhoff) [13:48:00] (03CR) 10CI reject: [V: 04-1] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [13:48:59] (03CR) 10Filippo Giunchedi: [C: 03+1] Add role_contacts for mwlog [puppet] - 10https://gerrit.wikimedia.org/r/860886 (owner: 10Muehlenhoff) [13:50:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:50:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:50:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:59:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41111 and previous config saved to /var/cache/conftool/dbconfig/20221125-135906-marostegui.json [14:14:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321126)', diff saved to https://phabricator.wikimedia.org/P41112 and previous config saved to /var/cache/conftool/dbconfig/20221125-141412-marostegui.json [14:14:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:14:19] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:14:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T321126)', diff saved to https://phabricator.wikimedia.org/P41113 and previous config saved to /var/cache/conftool/dbconfig/20221125-141434-marostegui.json [14:20:53] (03PS1) 10David Caro: harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) [14:24:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:24:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [14:24:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:25:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T323827)', diff saved to https://phabricator.wikimedia.org/P41114 and previous config saved to /var/cache/conftool/dbconfig/20221125-142506-ladsgroup.json [14:25:12] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [14:25:12] (03CR) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [14:25:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [14:25:19] (03PS5) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 [14:25:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41115 and previous config saved to /var/cache/conftool/dbconfig/20221125-142525-ladsgroup.json [14:36:02] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [14:36:30] (03PS4) 10David Caro: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 [14:39:06] (03Merged) 10jenkins-bot: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [14:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323827)', diff saved to https://phabricator.wikimedia.org/P41116 and previous config saved to /var/cache/conftool/dbconfig/20221125-144123-ladsgroup.json [14:41:30] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [14:42:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321126)', diff saved to https://phabricator.wikimedia.org/P41117 and previous config saved to /var/cache/conftool/dbconfig/20221125-144251-marostegui.json [14:42:57] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:56:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P41118 and previous config saved to /var/cache/conftool/dbconfig/20221125-145629-ladsgroup.json [14:56:44] (03PS2) 10Jbond: utils/puppet-debugger: add small shell script to run puppet-debugger [puppet] - 10https://gerrit.wikimedia.org/r/860874 [14:57:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41119 and previous config saved to /var/cache/conftool/dbconfig/20221125-145757-marostegui.json [15:01:39] (03CR) 10Elukey: [C: 03+1] Set role_contacts for ml-cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/860854 (owner: 10Muehlenhoff) [15:03:43] (03CR) 10Ssingh: "Thanks for the patch! 4007 is replaced by 4010." [puppet] - 10https://gerrit.wikimedia.org/r/860878 (owner: 10Muehlenhoff) [15:04:16] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:05:28] (03PS1) 10Ssingh: hiera: remove lvs4005.yaml (host decommed) [puppet] - 10https://gerrit.wikimedia.org/r/860899 [15:06:16] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:06:46] (03PS2) 10Ssingh: hiera: remove lvs4005.yaml (host decommed) [puppet] - 10https://gerrit.wikimedia.org/r/860899 [15:07:02] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:07:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41120 and previous config saved to /var/cache/conftool/dbconfig/20221125-150719-ladsgroup.json [15:07:27] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:09:24] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, nitpick: missing task in commit message" [puppet] - 10https://gerrit.wikimedia.org/r/860899 (owner: 10Ssingh) [15:10:20] (03CR) 10Ssingh: hiera: remove lvs4005.yaml (host decommed) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860899 (owner: 10Ssingh) [15:10:30] (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs4005.yaml (host decommed) [puppet] - 10https://gerrit.wikimedia.org/r/860899 (owner: 10Ssingh) [15:11:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P41121 and previous config saved to /var/cache/conftool/dbconfig/20221125-151135-ladsgroup.json [15:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41122 and previous config saved to /var/cache/conftool/dbconfig/20221125-151303-marostegui.json [15:14:00] 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [15:14:04] (03PS1) 10Volans: turnilo: fix TTFB metric for webrequests datasets [puppet] - 10https://gerrit.wikimedia.org/r/860900 [15:15:22] (03CR) 10Elukey: [C: 03+1] turnilo: fix TTFB metric for webrequests datasets [puppet] - 10https://gerrit.wikimedia.org/r/860900 (owner: 10Volans) [15:16:28] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:16:58] (03CR) 10Btullis: [C: 03+1] turnilo: fix TTFB metric for webrequests datasets [puppet] - 10https://gerrit.wikimedia.org/r/860900 (owner: 10Volans) [15:18:32] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:21:24] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:22:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P41123 and previous config saved to /var/cache/conftool/dbconfig/20221125-152225-ladsgroup.json [15:24:01] 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [15:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323827)', diff saved to https://phabricator.wikimedia.org/P41124 and previous config saved to /var/cache/conftool/dbconfig/20221125-152642-ladsgroup.json [15:26:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:26:47] 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [15:26:50] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:26:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T323827)', diff saved to https://phabricator.wikimedia.org/P41125 and previous config saved to /var/cache/conftool/dbconfig/20221125-152704-ladsgroup.json [15:27:29] 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) p:05Triage→03Medium [15:27:40] 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [15:27:44] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [15:27:58] (03CR) 10Volans: [C: 03+2] turnilo: fix TTFB metric for webrequests datasets [puppet] - 10https://gerrit.wikimedia.org/r/860900 (owner: 10Volans) [15:28:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321126)', diff saved to https://phabricator.wikimedia.org/P41126 and previous config saved to /var/cache/conftool/dbconfig/20221125-152810-marostegui.json [15:28:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:28:16] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:28:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:29:34] jnuche: FYI I've puppet-merged your change on labs/private for keyholder: add new identity for deploy-releng user [15:29:53] sukhe: ok to merge your hiera: remove lvs4005.yaml (host decommed) (4ed14927e7) ? [15:30:41] oh yes please, I had a pending change from someone else [15:30:43] volans: oh, thanks [15:30:48] but go ahead with mine yes [15:30:51] ack [15:30:52] thx [15:30:59] {done} [15:32:08] thanks! [15:34:17] (03PS1) 10Jelto: P:spicerack: add python-gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) [15:34:52] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:35:02] (03CR) 10Muehlenhoff: Update lvs-canary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860878 (owner: 10Muehlenhoff) [15:35:38] (03CR) 10Ssingh: [C: 03+1] Update lvs-canary [puppet] - 10https://gerrit.wikimedia.org/r/860878 (owner: 10Muehlenhoff) [15:35:53] (03CR) 10Jelto: "thanks a lot for the detailed review! I answered mostly in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:36:28] (03CR) 10Muehlenhoff: [C: 03+2] Update lvs-canary [puppet] - 10https://gerrit.wikimedia.org/r/860878 (owner: 10Muehlenhoff) [15:36:59] (03CR) 10Muehlenhoff: [C: 03+2] Set role_contacts for ml-cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/860854 (owner: 10Muehlenhoff) [15:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P41127 and previous config saved to /var/cache/conftool/dbconfig/20221125-153732-ladsgroup.json [15:42:58] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:48:18] (03CR) 10David Caro: [C: 04-1] "For a "just works" review, only the addition of the `--long` parameter for the `server_list` is needed as it breaks other cookbooks." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [15:49:10] (03PS5) 10David Caro: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 [15:49:23] (03CR) 10David Caro: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 (owner: 10David Caro) [15:50:08] (03Abandoned) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [15:50:17] (03Abandoned) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [15:50:34] (03PS1) 10Muehlenhoff: openstack/codfw1dev: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860903 (https://phabricator.wikimedia.org/T308013) [15:50:36] (03PS1) 10Muehlenhoff: rsyslog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860904 (https://phabricator.wikimedia.org/T308013) [15:50:38] (03PS1) 10Muehlenhoff: phabricator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) [15:50:40] (03PS1) 10Muehlenhoff: elasticsearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860906 (https://phabricator.wikimedia.org/T308013) [15:50:42] (03PS1) 10Muehlenhoff: zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860907 (https://phabricator.wikimedia.org/T308013) [15:50:44] (03PS1) 10Muehlenhoff: ceph: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860908 (https://phabricator.wikimedia.org/T308013) [15:50:46] (03PS1) 10Muehlenhoff: puppetmaster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860909 (https://phabricator.wikimedia.org/T308013) [15:50:48] (03PS1) 10Muehlenhoff: graphite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860910 (https://phabricator.wikimedia.org/T308013) [15:50:50] (03PS1) 10Muehlenhoff: mariadb::proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860911 (https://phabricator.wikimedia.org/T308013) [15:50:52] (03PS1) 10Muehlenhoff: Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013) [15:50:54] (03PS1) 10Muehlenhoff: mariadb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860913 (https://phabricator.wikimedia.org/T308013) [15:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41128 and previous config saved to /var/cache/conftool/dbconfig/20221125-155238-ladsgroup.json [15:52:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [15:52:45] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:52:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [15:53:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T323827)', diff saved to https://phabricator.wikimedia.org/P41129 and previous config saved to /var/cache/conftool/dbconfig/20221125-155300-ladsgroup.json [15:54:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:54:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:54:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T321126)', diff saved to https://phabricator.wikimedia.org/P41130 and previous config saved to /var/cache/conftool/dbconfig/20221125-155447-marostegui.json [15:54:53] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:55:40] (03PS2) 10David Caro: p::metricsinfra:haproxy: rename some vars to reflect intent [puppet] - 10https://gerrit.wikimedia.org/r/831036 [16:00:27] (03CR) 10David Caro: [C: 03+2] create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 (owner: 10David Caro) [16:05:34] (03PS19) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [16:05:36] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow to list servers with extra information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860915 [16:07:52] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/860908 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:07:52] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323827)', diff saved to https://phabricator.wikimedia.org/P41131 and previous config saved to /var/cache/conftool/dbconfig/20221125-160755-ladsgroup.json [16:08:03] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:11:50] <_joe_> !log upgraded vopsbot to 0.3.2 [16:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:40] (03Merged) 10jenkins-bot: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 (owner: 10David Caro) [16:22:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321126)', diff saved to https://phabricator.wikimedia.org/P41132 and previous config saved to /var/cache/conftool/dbconfig/20221125-162251-marostegui.json [16:22:58] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P41133 and previous config saved to /var/cache/conftool/dbconfig/20221125-162302-ladsgroup.json [16:24:33] Is it just me, or is Zuul/Jenkins "stuck"? [16:24:52] !log restarted turnilo on an-tool1007 [16:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:30] (03CR) 10Elukey: [C: 03+1] zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860907 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:30:18] _joe_: ^ looks like sirenbot decided to apply social distancing between the channel logs link and the clinic duty person [16:31:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T323827)', diff saved to https://phabricator.wikimedia.org/P41134 and previous config saved to /var/cache/conftool/dbconfig/20221125-163147-ladsgroup.json [16:31:54] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:33:11] (03CR) 10CI reject: [V: 04-1] wmcs: openstack: common: allow to list servers with extra information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860915 (owner: 10Arturo Borrero Gonzalez) [16:33:28] (03CR) 10CI reject: [V: 04-1] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [16:34:14] <_joe_> taavi: sigh yeah my bad [16:34:24] <_joe_> I'll fix :) [16:34:42] thanks! [16:37:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41135 and previous config saved to /var/cache/conftool/dbconfig/20221125-163758-marostegui.json [16:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P41136 and previous config saved to /var/cache/conftool/dbconfig/20221125-163808-ladsgroup.json [16:46:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P41137 and previous config saved to /var/cache/conftool/dbconfig/20221125-164654-ladsgroup.json [16:49:32] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@f6b8a0a]: (no justification provided) [16:49:51] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@f6b8a0a]: (no justification provided) (duration: 00m 18s) [16:53:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41138 and previous config saved to /var/cache/conftool/dbconfig/20221125-165304-marostegui.json [16:53:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323827)', diff saved to https://phabricator.wikimedia.org/P41139 and previous config saved to /var/cache/conftool/dbconfig/20221125-165315-ladsgroup.json [16:53:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:53:21] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:53:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:53:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:53:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:53:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T323827)', diff saved to https://phabricator.wikimedia.org/P41140 and previous config saved to /var/cache/conftool/dbconfig/20221125-165341-ladsgroup.json [16:58:15] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow to list servers with extra information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860915 [16:58:17] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: inventory: add support to network information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860924 [16:59:00] <_joe_> taavi: again thanks for noticing, the bug is fixed [17:02:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P41141 and previous config saved to /var/cache/conftool/dbconfig/20221125-170200-ladsgroup.json [17:08:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321126)', diff saved to https://phabricator.wikimedia.org/P41142 and previous config saved to /var/cache/conftool/dbconfig/20221125-170811-marostegui.json [17:08:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:08:17] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:08:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:08:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:08:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T321126)', diff saved to https://phabricator.wikimedia.org/P41143 and previous config saved to /var/cache/conftool/dbconfig/20221125-170859-marostegui.json [17:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323827)', diff saved to https://phabricator.wikimedia.org/P41144 and previous config saved to /var/cache/conftool/dbconfig/20221125-171032-ladsgroup.json [17:10:39] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:17:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T323827)', diff saved to https://phabricator.wikimedia.org/P41145 and previous config saved to /var/cache/conftool/dbconfig/20221125-171707-ladsgroup.json [17:17:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:17:15] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:17:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:17:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T323827)', diff saved to https://phabricator.wikimedia.org/P41146 and previous config saved to /var/cache/conftool/dbconfig/20221125-171729-ladsgroup.json [17:20:14] (03PS3) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow to list servers with extra information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860915 [17:20:16] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: inventory: add support to network information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860924 [17:20:18] (03PS20) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [17:21:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:21:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:23:18] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:46] (03CR) 10CI reject: [V: 04-1] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [17:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P41147 and previous config saved to /var/cache/conftool/dbconfig/20221125-172538-ladsgroup.json [17:29:09] (03PS1) 10Elukey: Add basic rate-limit capabilities to ML clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) [17:32:01] (03CR) 10Elukey: Add basic rate-limit capabilities to ML clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) (owner: 10Elukey) [17:33:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323827)', diff saved to https://phabricator.wikimedia.org/P41148 and previous config saved to /var/cache/conftool/dbconfig/20221125-173340-ladsgroup.json [17:33:47] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321126)', diff saved to https://phabricator.wikimedia.org/P41149 and previous config saved to /var/cache/conftool/dbconfig/20221125-173545-marostegui.json [17:35:52] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:38:35] !log initiating Cassandra bootstrap, aqs1021-a -- T307802 [17:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:41] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [17:39:28] RECOVERY - cassandra-a service on aqs1021 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:40:16] RECOVERY - cassandra-a SSL 10.64.135.14:7001 on aqs1021 is OK: SSL OK - Certificate aqs1021-a valid until 2024-11-08 15:06:40 +0000 (expires in 713 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P41150 and previous config saved to /var/cache/conftool/dbconfig/20221125-174045-ladsgroup.json [17:48:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P41151 and previous config saved to /var/cache/conftool/dbconfig/20221125-174847-ladsgroup.json [17:49:28] (03PS1) 10Ssingh: P:durum: add a note for users with JavaScript disabled [puppet] - 10https://gerrit.wikimedia.org/r/860928 [17:50:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:50:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41152 and previous config saved to /var/cache/conftool/dbconfig/20221125-175052-marostegui.json [17:50:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:51:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:51:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T323827)', diff saved to https://phabricator.wikimedia.org/P41153 and previous config saved to /var/cache/conftool/dbconfig/20221125-175114-ladsgroup.json [17:51:20] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:51:30] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38445/console" [puppet] - 10https://gerrit.wikimedia.org/r/860928 (owner: 10Ssingh) [17:52:22] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:durum: add a note for users with JavaScript disabled [puppet] - 10https://gerrit.wikimedia.org/r/860928 (owner: 10Ssingh) [17:55:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323827)', diff saved to https://phabricator.wikimedia.org/P41154 and previous config saved to /var/cache/conftool/dbconfig/20221125-175551-ladsgroup.json [17:55:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:56:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:56:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41155 and previous config saved to /var/cache/conftool/dbconfig/20221125-175624-ladsgroup.json [17:56:30] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:03:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P41156 and previous config saved to /var/cache/conftool/dbconfig/20221125-180353-ladsgroup.json [18:05:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41157 and previous config saved to /var/cache/conftool/dbconfig/20221125-180558-marostegui.json [18:06:34] (03PS1) 10Ssingh: hiera: unify ulsfo LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/860930 (https://phabricator.wikimedia.org/T317247) [18:07:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38446/console" [puppet] - 10https://gerrit.wikimedia.org/r/860930 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [18:07:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323827)', diff saved to https://phabricator.wikimedia.org/P41158 and previous config saved to /var/cache/conftool/dbconfig/20221125-180753-ladsgroup.json [18:07:59] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:17:20] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860904 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:17:27] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860910 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:19:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323827)', diff saved to https://phabricator.wikimedia.org/P41159 and previous config saved to /var/cache/conftool/dbconfig/20221125-181900-ladsgroup.json [18:19:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:19:07] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:19:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321126)', diff saved to https://phabricator.wikimedia.org/P41160 and previous config saved to /var/cache/conftool/dbconfig/20221125-182105-marostegui.json [18:21:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:21:11] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:21:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T321126)', diff saved to https://phabricator.wikimedia.org/P41161 and previous config saved to /var/cache/conftool/dbconfig/20221125-182126-marostegui.json [18:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41162 and previous config saved to /var/cache/conftool/dbconfig/20221125-182259-ladsgroup.json [18:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41163 and previous config saved to /var/cache/conftool/dbconfig/20221125-183356-ladsgroup.json [18:34:03] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:37:28] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan) [18:38:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41164 and previous config saved to /var/cache/conftool/dbconfig/20221125-183806-ladsgroup.json [18:49:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P41165 and previous config saved to /var/cache/conftool/dbconfig/20221125-184902-ladsgroup.json [18:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321126)', diff saved to https://phabricator.wikimedia.org/P41166 and previous config saved to /var/cache/conftool/dbconfig/20221125-184943-marostegui.json [18:49:49] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:52:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:52:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:52:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41167 and previous config saved to /var/cache/conftool/dbconfig/20221125-185257-ladsgroup.json [18:53:03] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:53:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323827)', diff saved to https://phabricator.wikimedia.org/P41168 and previous config saved to /var/cache/conftool/dbconfig/20221125-185312-ladsgroup.json [18:53:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [18:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P41169 and previous config saved to /var/cache/conftool/dbconfig/20221125-190409-ladsgroup.json [19:04:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41170 and previous config saved to /var/cache/conftool/dbconfig/20221125-190450-marostegui.json [19:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41171 and previous config saved to /var/cache/conftool/dbconfig/20221125-191915-ladsgroup.json [19:19:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:19:23] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:19:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:19:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T323827)', diff saved to https://phabricator.wikimedia.org/P41172 and previous config saved to /var/cache/conftool/dbconfig/20221125-191937-ladsgroup.json [19:19:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41173 and previous config saved to /var/cache/conftool/dbconfig/20221125-191956-marostegui.json [19:21:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:21:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T323827)', diff saved to https://phabricator.wikimedia.org/P41174 and previous config saved to /var/cache/conftool/dbconfig/20221125-192147-ladsgroup.json [19:25:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323827)', diff saved to https://phabricator.wikimedia.org/P41175 and previous config saved to /var/cache/conftool/dbconfig/20221125-192530-ladsgroup.json [19:25:36] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:26:32] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41176 and previous config saved to /var/cache/conftool/dbconfig/20221125-193145-ladsgroup.json [19:31:52] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:33:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10andrea.denisse) Hi @VirginiaPoundstone , where you able to access https://datahub.wikimedia.org/ in the past? What error do you get when trying to log in? [19:35:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321126)', diff saved to https://phabricator.wikimedia.org/P41177 and previous config saved to /var/cache/conftool/dbconfig/20221125-193503-marostegui.json [19:35:10] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:40:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41178 and previous config saved to /var/cache/conftool/dbconfig/20221125-194036-ladsgroup.json [19:46:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P41179 and previous config saved to /var/cache/conftool/dbconfig/20221125-194652-ladsgroup.json [19:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41180 and previous config saved to /var/cache/conftool/dbconfig/20221125-195543-ladsgroup.json [19:56:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323827)', diff saved to https://phabricator.wikimedia.org/P41181 and previous config saved to /var/cache/conftool/dbconfig/20221125-195652-ladsgroup.json [19:56:58] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:01:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P41182 and previous config saved to /var/cache/conftool/dbconfig/20221125-200158-ladsgroup.json [20:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323827)', diff saved to https://phabricator.wikimedia.org/P41183 and previous config saved to /var/cache/conftool/dbconfig/20221125-201049-ladsgroup.json [20:10:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance [20:10:56] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:11:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance [20:11:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T323827)', diff saved to https://phabricator.wikimedia.org/P41184 and previous config saved to /var/cache/conftool/dbconfig/20221125-201111-ladsgroup.json [20:11:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P41185 and previous config saved to /var/cache/conftool/dbconfig/20221125-201158-ladsgroup.json [20:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41186 and previous config saved to /var/cache/conftool/dbconfig/20221125-201705-ladsgroup.json [20:17:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [20:17:12] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:17:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [20:17:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:17:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:17:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41187 and previous config saved to /var/cache/conftool/dbconfig/20221125-201754-ladsgroup.json [20:22:52] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323827)', diff saved to https://phabricator.wikimedia.org/P41188 and previous config saved to /var/cache/conftool/dbconfig/20221125-202557-ladsgroup.json [20:26:04] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:27:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P41189 and previous config saved to /var/cache/conftool/dbconfig/20221125-202705-ladsgroup.json [20:28:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10andrea.denisse) Hi @dom_walden , according to [[ https://wikitech.wikimedia.org/wiki/SRE/Production_access#Add_WMF/WMDE_Staff_to_an_access_group | our procedures ]] in order to add you to a grou... [20:32:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:37:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41190 and previous config saved to /var/cache/conftool/dbconfig/20221125-204103-ladsgroup.json [20:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323827)', diff saved to https://phabricator.wikimedia.org/P41191 and previous config saved to /var/cache/conftool/dbconfig/20221125-204211-ladsgroup.json [20:42:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:42:18] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:42:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:42:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41192 and previous config saved to /var/cache/conftool/dbconfig/20221125-204244-ladsgroup.json [20:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41193 and previous config saved to /var/cache/conftool/dbconfig/20221125-205457-ladsgroup.json [20:55:04] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41194 and previous config saved to /var/cache/conftool/dbconfig/20221125-205609-ladsgroup.json [21:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P41195 and previous config saved to /var/cache/conftool/dbconfig/20221125-211003-ladsgroup.json [21:11:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323827)', diff saved to https://phabricator.wikimedia.org/P41196 and previous config saved to /var/cache/conftool/dbconfig/20221125-211116-ladsgroup.json [21:11:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:11:23] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [21:11:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41197 and previous config saved to /var/cache/conftool/dbconfig/20221125-211137-ladsgroup.json [21:20:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41198 and previous config saved to /var/cache/conftool/dbconfig/20221125-212020-ladsgroup.json [21:20:27] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [21:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P41199 and previous config saved to /var/cache/conftool/dbconfig/20221125-212510-ladsgroup.json [21:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41200 and previous config saved to /var/cache/conftool/dbconfig/20221125-212638-ladsgroup.json [21:26:45] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [21:35:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P41201 and previous config saved to /var/cache/conftool/dbconfig/20221125-213527-ladsgroup.json [21:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41202 and previous config saved to /var/cache/conftool/dbconfig/20221125-214016-ladsgroup.json [21:40:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:40:23] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [21:40:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:40:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41203 and previous config saved to /var/cache/conftool/dbconfig/20221125-214038-ladsgroup.json [21:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41204 and previous config saved to /var/cache/conftool/dbconfig/20221125-214144-ladsgroup.json [21:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P41205 and previous config saved to /var/cache/conftool/dbconfig/20221125-215034-ladsgroup.json [21:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41206 and previous config saved to /var/cache/conftool/dbconfig/20221125-215651-ladsgroup.json [22:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41207 and previous config saved to /var/cache/conftool/dbconfig/20221125-220541-ladsgroup.json [22:05:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2175.codfw.wmnet with reason: Maintenance [22:05:48] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [22:05:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2175.codfw.wmnet with reason: Maintenance [22:06:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41208 and previous config saved to /var/cache/conftool/dbconfig/20221125-220602-ladsgroup.json [22:07:40] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41209 and previous config saved to /var/cache/conftool/dbconfig/20221125-221157-ladsgroup.json [22:11:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [22:12:06] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [22:12:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [22:12:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T323827)', diff saved to https://phabricator.wikimedia.org/P41210 and previous config saved to /var/cache/conftool/dbconfig/20221125-221218-ladsgroup.json [22:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323827)', diff saved to https://phabricator.wikimedia.org/P41211 and previous config saved to /var/cache/conftool/dbconfig/20221125-221602-ladsgroup.json [22:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41212 and previous config saved to /var/cache/conftool/dbconfig/20221125-221938-ladsgroup.json [22:19:44] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [22:22:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:27:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41213 and previous config saved to /var/cache/conftool/dbconfig/20221125-223109-ladsgroup.json [22:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P41214 and previous config saved to /var/cache/conftool/dbconfig/20221125-223444-ladsgroup.json [22:44:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41215 and previous config saved to /var/cache/conftool/dbconfig/20221125-224443-ladsgroup.json [22:44:50] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [22:46:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41216 and previous config saved to /var/cache/conftool/dbconfig/20221125-224615-ladsgroup.json [22:49:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P41217 and previous config saved to /var/cache/conftool/dbconfig/20221125-224951-ladsgroup.json [22:51:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P41218 and previous config saved to /var/cache/conftool/dbconfig/20221125-225949-ladsgroup.json [23:01:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323827)', diff saved to https://phabricator.wikimedia.org/P41219 and previous config saved to /var/cache/conftool/dbconfig/20221125-230122-ladsgroup.json [23:01:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1189.eqiad.wmnet with reason: Maintenance [23:01:28] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [23:01:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1189.eqiad.wmnet with reason: Maintenance [23:01:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T323827)', diff saved to https://phabricator.wikimedia.org/P41220 and previous config saved to /var/cache/conftool/dbconfig/20221125-230143-ladsgroup.json [23:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41221 and previous config saved to /var/cache/conftool/dbconfig/20221125-230457-ladsgroup.json [23:04:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:05:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T323827)', diff saved to https://phabricator.wikimedia.org/P41222 and previous config saved to /var/cache/conftool/dbconfig/20221125-230518-ladsgroup.json [23:14:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P41223 and previous config saved to /var/cache/conftool/dbconfig/20221125-231456-ladsgroup.json [23:17:26] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41224 and previous config saved to /var/cache/conftool/dbconfig/20221125-233002-ladsgroup.json [23:30:10] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [23:30:20] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:32:16] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:42:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [23:42:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [23:42:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:42:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T323827)', diff saved to https://phabricator.wikimedia.org/P41225 and previous config saved to /var/cache/conftool/dbconfig/20221125-234305-ladsgroup.json [23:43:13] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [23:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323827)', diff saved to https://phabricator.wikimedia.org/P41226 and previous config saved to /var/cache/conftool/dbconfig/20221125-234428-ladsgroup.json [23:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323827)', diff saved to https://phabricator.wikimedia.org/P41227 and previous config saved to /var/cache/conftool/dbconfig/20221125-234836-ladsgroup.json [23:48:43] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [23:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323827)', diff saved to https://phabricator.wikimedia.org/P41228 and previous config saved to /var/cache/conftool/dbconfig/20221125-235919-ladsgroup.json [23:59:26] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [23:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P41229 and previous config saved to /var/cache/conftool/dbconfig/20221125-235935-ladsgroup.json