[00:03:38] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10andrea.denisse) Hello @Gehel , do you approve @Dcausse access to the `analytics-admins` group ? [00:05:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40880 and previous config saved to /var/cache/conftool/dbconfig/20221124-000543-marostegui.json [00:09:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:14:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P40881 and previous config saved to /var/cache/conftool/dbconfig/20221124-001435-ladsgroup.json [00:14:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:19:39] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [00:20:07] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40882 and previous config saved to /var/cache/conftool/dbconfig/20221124-002050-marostegui.json [00:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:29:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P40883 and previous config saved to /var/cache/conftool/dbconfig/20221124-002941-ladsgroup.json [00:30:17] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321126)', diff saved to https://phabricator.wikimedia.org/P40884 and previous config saved to /var/cache/conftool/dbconfig/20221124-003556-marostegui.json [00:35:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [00:36:03] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [00:36:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [00:36:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40885 and previous config saved to /var/cache/conftool/dbconfig/20221124-003618-marostegui.json [00:36:54] (03CR) 10Cwhite: [C: 03+1] Remove graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718) (owner: 10Filippo Giunchedi) [00:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40886 and previous config saved to /var/cache/conftool/dbconfig/20221124-003850-marostegui.json [00:38:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:39:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [00:39:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [00:39:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:40:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:40:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T323214)', diff saved to https://phabricator.wikimedia.org/P40887 and previous config saved to /var/cache/conftool/dbconfig/20221124-004006-ladsgroup.json [00:40:12] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [00:43:47] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40888 and previous config saved to /var/cache/conftool/dbconfig/20221124-004448-ladsgroup.json [00:44:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:45:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:45:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40889 and previous config saved to /var/cache/conftool/dbconfig/20221124-004510-ladsgroup.json [00:45:16] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [00:50:41] PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:53:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40890 and previous config saved to /var/cache/conftool/dbconfig/20221124-005357-marostegui.json [00:54:58] (03PS1) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) [00:55:35] (03PS2) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) [00:55:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:55:59] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:58:01] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [00:59:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10andrea.denisse) Hi @Htriedman and @Jcross , could you please help me to confirm that the expiry date for @dasm 's access is on the 2023-06-30? :) [01:00:34] (03CR) 10Andrea Denisse: "Hello, could you please review my patch?" [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [01:00:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:08:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10andrea.denisse) Hi @Ottomata , I just want to double check with you, Wenjun's access is ssh-less access to analytics-privatedata-users group, right? If so, to remove thei... [01:09:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40891 and previous config saved to /var/cache/conftool/dbconfig/20221124-010903-marostegui.json [01:09:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10andrea.denisse) Hi @XenoRyet , do you approve ssh-less access to the `analytics-privatedata-users` group for Wenjun Fan ? [01:10:59] (03CR) 10RLazarus: [C: 03+1] "Mostly LGTM, see below. :) You might want to wait until your expiry_date question is answered on Phab, but feel free to edit and merge wit" [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [01:11:13] denisse|m: ^ hope you don't mind the drive-by :) [01:13:50] (03PS3) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) [01:17:40] (03CR) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [01:18:03] rzl: On the contrary, thank you!! :D [01:24:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40892 and previous config saved to /var/cache/conftool/dbconfig/20221124-012409-marostegui.json [01:24:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [01:24:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [01:24:17] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [01:24:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40893 and previous config saved to /var/cache/conftool/dbconfig/20221124-012420-marostegui.json [01:26:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40894 and previous config saved to /var/cache/conftool/dbconfig/20221124-012652-marostegui.json [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40895 and previous config saved to /var/cache/conftool/dbconfig/20221124-014158-marostegui.json [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:45] (03PS1) 10RLazarus: httpbb: Bump the timeout for meta:List_of_Wikipedias, at least for now [puppet] - 10https://gerrit.wikimedia.org/r/860136 (https://phabricator.wikimedia.org/T323707) [01:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T323214)', diff saved to https://phabricator.wikimedia.org/P40896 and previous config saved to /var/cache/conftool/dbconfig/20221124-014908-ladsgroup.json [01:49:15] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [01:51:29] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:51:35] RECOVERY - SSH on mw1329.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:52:27] (03CR) 10RLazarus: [C: 03+2] httpbb: Bump the timeout for meta:List_of_Wikipedias, at least for now [puppet] - 10https://gerrit.wikimedia.org/r/860136 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:31] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:57:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40897 and previous config saved to /var/cache/conftool/dbconfig/20221124-015705-marostegui.json [02:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P40898 and previous config saved to /var/cache/conftool/dbconfig/20221124-020415-ladsgroup.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40899 and previous config saved to /var/cache/conftool/dbconfig/20221124-021211-marostegui.json [02:12:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2182.codfw.wmnet with reason: Maintenance [02:12:18] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [02:12:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2182.codfw.wmnet with reason: Maintenance [02:12:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T321126)', diff saved to https://phabricator.wikimedia.org/P40900 and previous config saved to /var/cache/conftool/dbconfig/20221124-021233-marostegui.json [02:15:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321126)', diff saved to https://phabricator.wikimedia.org/P40901 and previous config saved to /var/cache/conftool/dbconfig/20221124-021505-marostegui.json [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P40902 and previous config saved to /var/cache/conftool/dbconfig/20221124-021921-ladsgroup.json [02:23:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40903 and previous config saved to /var/cache/conftool/dbconfig/20221124-022309-ladsgroup.json [02:23:16] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:30:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40904 and previous config saved to /var/cache/conftool/dbconfig/20221124-023011-marostegui.json [02:34:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T323214)', diff saved to https://phabricator.wikimedia.org/P40905 and previous config saved to /var/cache/conftool/dbconfig/20221124-023428-ladsgroup.json [02:34:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [02:34:34] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:34:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [02:35:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T323214)', diff saved to https://phabricator.wikimedia.org/P40906 and previous config saved to /var/cache/conftool/dbconfig/20221124-023500-ladsgroup.json [02:38:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P40907 and previous config saved to /var/cache/conftool/dbconfig/20221124-023816-ladsgroup.json [02:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:45:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:45:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40908 and previous config saved to /var/cache/conftool/dbconfig/20221124-024518-marostegui.json [02:53:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P40909 and previous config saved to /var/cache/conftool/dbconfig/20221124-025322-ladsgroup.json [03:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321126)', diff saved to https://phabricator.wikimedia.org/P40910 and previous config saved to /var/cache/conftool/dbconfig/20221124-030025-marostegui.json [03:00:32] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [03:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40911 and previous config saved to /var/cache/conftool/dbconfig/20221124-030829-ladsgroup.json [03:08:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [03:08:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [03:08:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [03:09:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40912 and previous config saved to /var/cache/conftool/dbconfig/20221124-030901-ladsgroup.json [03:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T323214)', diff saved to https://phabricator.wikimedia.org/P40913 and previous config saved to /var/cache/conftool/dbconfig/20221124-034217-ladsgroup.json [03:42:23] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [03:55:28] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P40914 and previous config saved to /var/cache/conftool/dbconfig/20221124-035723-ladsgroup.json [04:02:43] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:12:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P40915 and previous config saved to /var/cache/conftool/dbconfig/20221124-041230-ladsgroup.json [04:27:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T323214)', diff saved to https://phabricator.wikimedia.org/P40916 and previous config saved to /var/cache/conftool/dbconfig/20221124-042736-ladsgroup.json [04:27:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [04:27:43] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [04:27:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [04:27:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T323214)', diff saved to https://phabricator.wikimedia.org/P40917 and previous config saved to /var/cache/conftool/dbconfig/20221124-042757-ladsgroup.json [04:42:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40918 and previous config saved to /var/cache/conftool/dbconfig/20221124-044249-ladsgroup.json [04:42:56] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [04:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P40919 and previous config saved to /var/cache/conftool/dbconfig/20221124-045755-ladsgroup.json [05:13:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P40920 and previous config saved to /var/cache/conftool/dbconfig/20221124-051301-ladsgroup.json [05:16:07] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:17:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T323214)', diff saved to https://phabricator.wikimedia.org/P40921 and previous config saved to /var/cache/conftool/dbconfig/20221124-051749-ladsgroup.json [05:17:56] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40922 and previous config saved to /var/cache/conftool/dbconfig/20221124-052808-ladsgroup.json [05:28:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:28:15] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:28:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T323214)', diff saved to https://phabricator.wikimedia.org/P40923 and previous config saved to /var/cache/conftool/dbconfig/20221124-052830-ladsgroup.json [05:32:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P40924 and previous config saved to /var/cache/conftool/dbconfig/20221124-053256-ladsgroup.json [05:48:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P40925 and previous config saved to /var/cache/conftool/dbconfig/20221124-054802-ladsgroup.json [06:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T323214)', diff saved to https://phabricator.wikimedia.org/P40926 and previous config saved to /var/cache/conftool/dbconfig/20221124-060309-ladsgroup.json [06:03:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [06:03:16] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [06:03:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [06:03:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T323214)', diff saved to https://phabricator.wikimedia.org/P40927 and previous config saved to /var/cache/conftool/dbconfig/20221124-060330-ladsgroup.json [06:06:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T323117 [06:06:21] T323117: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T323117 [06:06:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T323117 [06:07:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1136 with weight 0 T323117', diff saved to https://phabricator.wikimedia.org/P40928 and previous config saved to /var/cache/conftool/dbconfig/20221124-060742-ladsgroup.json [06:21:24] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Gehel) I approve! [06:28:55] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 249 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:29:52] (03PS2) 10Giuseppe Lavagetto: citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487 [06:30:59] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T323214)', diff saved to https://phabricator.wikimedia.org/P40929 and previous config saved to /var/cache/conftool/dbconfig/20221124-065057-ladsgroup.json [06:51:04] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [06:52:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487 (owner: 10Giuseppe Lavagetto) [06:56:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:56:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:56:47] (03Merged) 10jenkins-bot: citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487 (owner: 10Giuseppe Lavagetto) [06:58:16] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) Given that it is a public holiday in the US and Papaul won't be onsite till Monday, I am starting replication so the host doesn't get behind that many days. I will stop it again on Monday. [06:59:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T323214)', diff saved to https://phabricator.wikimedia.org/P40930 and previous config saved to /var/cache/conftool/dbconfig/20221124-065956-ladsgroup.json [07:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T0700). Please do the needful. [07:00:06] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:00:14] moin [07:00:17] starting [07:01:11] (03PS2) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/856498 (https://phabricator.wikimedia.org/T323117) (owner: 10Gerrit maintenance bot) [07:01:15] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/856498 (https://phabricator.wikimedia.org/T323117) (owner: 10Gerrit maintenance bot) [07:01:18] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/856498 (https://phabricator.wikimedia.org/T323117) (owner: 10Gerrit maintenance bot) [07:02:04] !log Starting s7 eqiad failover from db1181 to db1136 - T323117 [07:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:10] T323117: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T323117 [07:02:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T323117', diff saved to https://phabricator.wikimedia.org/P40931 and previous config saved to /var/cache/conftool/dbconfig/20221124-070215-ladsgroup.json [07:02:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:02:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1136 to s7 primary and set section read-write T323117', diff saved to https://phabricator.wikimedia.org/P40932 and previous config saved to /var/cache/conftool/dbconfig/20221124-070250-ladsgroup.json [07:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:03:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:03:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:04:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:04:26] (03PS2) 10Ladsgroup: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/856499 (https://phabricator.wikimedia.org/T323117) (owner: 10Gerrit maintenance bot) [07:04:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:04:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/856499 (https://phabricator.wikimedia.org/T323117) (owner: 10Gerrit maintenance bot) [07:04:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P40933 and previous config saved to /var/cache/conftool/dbconfig/20221124-070437-marostegui.json [07:04:43] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:05:00] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [07:05:16] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [07:05:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1181 T323117', diff saved to https://phabricator.wikimedia.org/P40934 and previous config saved to /var/cache/conftool/dbconfig/20221124-070546-ladsgroup.json [07:05:47] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [07:06:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P40935 and previous config saved to /var/cache/conftool/dbconfig/20221124-070603-ladsgroup.json [07:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P40936 and previous config saved to /var/cache/conftool/dbconfig/20221124-070645-marostegui.json [07:06:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:06:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:07:45] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [07:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:08:08] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [07:09:21] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [07:09:46] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [07:12:55] RECOVERY - swift eqiad object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [07:14:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [07:14:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [07:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P40938 and previous config saved to /var/cache/conftool/dbconfig/20221124-071504-ladsgroup.json [07:15:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:15:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:21:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P40939 and previous config saved to /var/cache/conftool/dbconfig/20221124-072110-ladsgroup.json [07:21:46] (03PS2) 10Stang: wikidatawiki: Add language-specific logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860117 (https://phabricator.wikimedia.org/T323734) [07:21:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P40940 and previous config saved to /var/cache/conftool/dbconfig/20221124-072152-marostegui.json [07:28:36] (03PS1) 10Volans: ulsfo mgmt: remove missing netbox include [dns] - 10https://gerrit.wikimedia.org/r/860474 [07:30:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P40941 and previous config saved to /var/cache/conftool/dbconfig/20221124-073011-ladsgroup.json [07:30:35] !log phedenskog@deploy1002 Started deploy [performance/navtiming@e421904]: (no justification provided) [07:30:40] (03CR) 10Volans: [C: 03+2] ulsfo mgmt: remove missing netbox include [dns] - 10https://gerrit.wikimedia.org/r/860474 (owner: 10Volans) [07:30:44] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@e421904]: (no justification provided) (duration: 00m 08s) [07:35:30] (03PS2) 10Giuseppe Lavagetto: cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488 [07:36:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T323214)', diff saved to https://phabricator.wikimedia.org/P40942 and previous config saved to /var/cache/conftool/dbconfig/20221124-073616-ladsgroup.json [07:36:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [07:36:23] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:36:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [07:36:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T323214)', diff saved to https://phabricator.wikimedia.org/P40943 and previous config saved to /var/cache/conftool/dbconfig/20221124-073637-ladsgroup.json [07:36:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P40944 and previous config saved to /var/cache/conftool/dbconfig/20221124-073658-marostegui.json [07:41:47] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [07:42:59] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [07:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T323214)', diff saved to https://phabricator.wikimedia.org/P40945 and previous config saved to /var/cache/conftool/dbconfig/20221124-074517-ladsgroup.json [07:45:24] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P40946 and previous config saved to /var/cache/conftool/dbconfig/20221124-075205-marostegui.json [07:52:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:52:12] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:52:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P40947 and previous config saved to /var/cache/conftool/dbconfig/20221124-075226-marostegui.json [07:54:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P40948 and previous config saved to /var/cache/conftool/dbconfig/20221124-075434-marostegui.json [07:57:02] (03PS1) 10Marostegui: control-mariadb-client-10.5: Delete file [software] - 10https://gerrit.wikimedia.org/r/860477 [07:57:50] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.5: Delete file [software] - 10https://gerrit.wikimedia.org/r/860477 (owner: 10Marostegui) [07:58:20] (03Merged) 10jenkins-bot: control-mariadb-client-10.5: Delete file [software] - 10https://gerrit.wikimedia.org/r/860477 (owner: 10Marostegui) [08:00:05] Amir1, apergos, and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T0800). [08:00:16] morning! there are no trainees signed up this morning and no patches scheduled for deployment in the window. [08:00:38] and this means.... you guessed it... see you next time! and have a happy holiday, folks in the U.S. [08:04:43] !log rebalance Ganeti group A/codfw following reboots [08:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P40949 and previous config saved to /var/cache/conftool/dbconfig/20221124-080941-marostegui.json [08:13:48] !log installing tomcat9 security updates [08:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:38] (03PS24) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) [08:24:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P40950 and previous config saved to /var/cache/conftool/dbconfig/20221124-082447-marostegui.json [08:24:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T323214)', diff saved to https://phabricator.wikimedia.org/P40951 and previous config saved to /var/cache/conftool/dbconfig/20221124-082458-ladsgroup.json [08:25:04] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [08:26:59] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [08:30:31] (03PS25) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) [08:34:42] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [08:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P40952 and previous config saved to /var/cache/conftool/dbconfig/20221124-083954-marostegui.json [08:39:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1104.eqiad.wmnet with reason: Maintenance [08:40:01] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:40:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P40953 and previous config saved to /var/cache/conftool/dbconfig/20221124-084004-ladsgroup.json [08:40:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1104.eqiad.wmnet with reason: Maintenance [08:40:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T321126)', diff saved to https://phabricator.wikimedia.org/P40954 and previous config saved to /var/cache/conftool/dbconfig/20221124-084015-marostegui.json [08:41:24] Hello, I'm unsure where I should ask this, so be free to correct me. :) [08:41:33] I'm unable to extract archive file from https://dumps.wikimedia.org/wikidatawiki/20221120/ [08:41:41] ubuntu@ip-172-31-41-196:~$ tar xf wikidatawiki-20221120-pages-articles.xml.bz2 [08:41:41] tar (child): lbzip2: Cannot exec: No such file or directory [08:41:42] tar (child): Error is not recoverable: exiting now [08:41:42] tar: Child returned status 2 [08:41:43] tar: Error is not recoverable: exiting now [08:42:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T321126)', diff saved to https://phabricator.wikimedia.org/P40955 and previous config saved to /var/cache/conftool/dbconfig/20221124-084223-marostegui.json [08:46:07] Nevermind. Gotcha. This has solved issue for me: https://svennd.be/lbzip2-cannot-exec-no-such-file-or-directory/ [08:55:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P40956 and previous config saved to /var/cache/conftool/dbconfig/20221124-085511-ladsgroup.json [08:57:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P40957 and previous config saved to /var/cache/conftool/dbconfig/20221124-085729-marostegui.json [09:03:36] (03CR) 10Jelto: "This change is ready for review." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:10:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T323214)', diff saved to https://phabricator.wikimedia.org/P40958 and previous config saved to /var/cache/conftool/dbconfig/20221124-091017-ladsgroup.json [09:10:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:10:25] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [09:10:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:11:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488 (owner: 10Giuseppe Lavagetto) [09:12:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P40959 and previous config saved to /var/cache/conftool/dbconfig/20221124-091236-marostegui.json [09:15:54] (03Merged) 10jenkins-bot: cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488 (owner: 10Giuseppe Lavagetto) [09:17:16] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718) (owner: 10Filippo Giunchedi) [09:17:21] (03PS2) 10Filippo Giunchedi: Remove graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718) [09:20:30] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:22:04] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:23:36] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:24:55] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:26:04] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:26:45] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:27:14] (03PS1) 10Slyngshede: Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 [09:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T321126)', diff saved to https://phabricator.wikimedia.org/P40960 and previous config saved to /var/cache/conftool/dbconfig/20221124-092742-marostegui.json [09:27:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:27:49] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:27:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:28:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40961 and previous config saved to /var/cache/conftool/dbconfig/20221124-092804-marostegui.json [09:28:24] (03PS2) 10Giuseppe Lavagetto: datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 [09:28:32] (03PS2) 10Giuseppe Lavagetto: developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 [09:29:11] (03CR) 10CI reject: [V: 04-1] datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 (owner: 10Giuseppe Lavagetto) [09:29:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40962 and previous config saved to /var/cache/conftool/dbconfig/20221124-092912-marostegui.json [09:29:17] (03CR) 10CI reject: [V: 04-1] developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 (owner: 10Giuseppe Lavagetto) [09:30:24] (03PS2) 10Jaime Nuche: k8s builder: allow deployers to sudo update-mediawiki-tools-release [puppet] - 10https://gerrit.wikimedia.org/r/860121 (https://phabricator.wikimedia.org/T323735) (owner: 10Brennen Bearnes) [09:33:33] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts graphite2003.codfw.wmnet [09:35:35] (03CR) 10Jaime Nuche: "Thanks Brennen. I've moved the sudo privilege to the K8s builder profile, which is where we were already granting some of these permission" [puppet] - 10https://gerrit.wikimedia.org/r/860121 (https://phabricator.wikimedia.org/T323735) (owner: 10Brennen Bearnes) [09:38:17] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [09:40:25] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: graphite2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [09:41:11] (03PS3) 10Giuseppe Lavagetto: datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 [09:41:13] (03PS3) 10Giuseppe Lavagetto: developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 [09:41:54] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: graphite2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [09:41:54] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:41:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts graphite2003.codfw.wmnet [09:42:10] (03CR) 10Volans: [C: 04-1] "Nice addition! Some issues, questions and comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:42:31] 10ops-codfw, 10decommission-hardware, 10User-fgiunchedi: decommission graphite2003.codfw.wmnet - https://phabricator.wikimedia.org/T323718 (10fgiunchedi) [09:42:48] 10ops-codfw, 10decommission-hardware, 10User-fgiunchedi: decommission graphite2003.codfw.wmnet - https://phabricator.wikimedia.org/T323718 (10fgiunchedi) @Papaul host is ready for decom [09:44:04] (03PS1) 10Giuseppe Lavagetto: mathoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860509 [09:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P40963 and previous config saved to /var/cache/conftool/dbconfig/20221124-094418-marostegui.json [09:44:52] (03PS1) 10Giuseppe Lavagetto: miscweb: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860510 [09:45:37] (03PS2) 10Slyngshede: Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 [09:46:44] (03PS1) 10Giuseppe Lavagetto: recommendation-api: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860511 [09:47:25] (03PS1) 10Giuseppe Lavagetto: eventstreams: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860512 [09:48:06] (03PS1) 10Giuseppe Lavagetto: wikifeeds: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860513 [09:48:38] (03CR) 10Jaime Nuche: "PCC compilation looks healthy: https://puppet-compiler.wmflabs.org/output/860121/38418/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/860121 (https://phabricator.wikimedia.org/T323735) (owner: 10Brennen Bearnes) [09:50:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] k8s builder: allow deployers to sudo update-mediawiki-tools-release [puppet] - 10https://gerrit.wikimedia.org/r/860121 (https://phabricator.wikimedia.org/T323735) (owner: 10Brennen Bearnes) [09:52:45] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) a:03BTullis [09:54:46] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [09:57:51] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removed AAAA entry for clouddb1013 - dcaro@cumin1001" [09:58:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:58:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:59:11] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removed AAAA entry for clouddb1013 - dcaro@cumin1001" [09:59:11] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P40964 and previous config saved to /var/cache/conftool/dbconfig/20221124-095925-marostegui.json [10:03:37] (03PS1) 10Giuseppe Lavagetto: wikifeeds: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860515 [10:04:36] (03PS2) 10Giuseppe Lavagetto: kask: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860515 [10:05:41] (03PS1) 10Giuseppe Lavagetto: changeprop: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860517 [10:06:00] (03PS1) 10Giuseppe Lavagetto: eventgate: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 [10:07:14] (03CR) 10Muehlenhoff: Fix typing to allow Python 3.7 support. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede) [10:07:56] (03PS1) 10Giuseppe Lavagetto: linkrecommendation: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860519 [10:08:14] (03PS1) 10Giuseppe Lavagetto: mobileapps: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860520 [10:09:11] (03PS1) 10Filippo Giunchedi: graphite: mirror traffic to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/860521 (https://phabricator.wikimedia.org/T318903) [10:09:13] (03PS1) 10Filippo Giunchedi: hieradata: pool graphite1005 for reads [puppet] - 10https://gerrit.wikimedia.org/r/860522 (https://phabricator.wikimedia.org/T318903) [10:14:27] (03CR) 10Muehlenhoff: Allow configuration from json file. (032 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 (owner: 10Slyngshede) [10:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40965 and previous config saved to /var/cache/conftool/dbconfig/20221124-101431-marostegui.json [10:14:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:14:38] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:14:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:14:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T321126)', diff saved to https://phabricator.wikimedia.org/P40966 and previous config saved to /var/cache/conftool/dbconfig/20221124-101452-marostegui.json [10:16:43] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [10:16:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 (owner: 10Giuseppe Lavagetto) [10:17:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T321126)', diff saved to https://phabricator.wikimedia.org/P40967 and previous config saved to /var/cache/conftool/dbconfig/20221124-101701-marostegui.json [10:17:55] (03PS1) 10Btullis: Add dcausse and gmodena to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) [10:19:13] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [10:19:24] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removed AAAA entry for all clouddbs - dcaro@cumin1001" [10:20:45] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removed AAAA entry for all clouddbs - dcaro@cumin1001" [10:20:45] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:21:27] (03Merged) 10jenkins-bot: datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 (owner: 10Giuseppe Lavagetto) [10:23:49] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:23:52] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:24:16] (03PS2) 10Btullis: Add dcausse and gmodena to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) [10:25:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:26:05] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:27:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38419/console" [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) (owner: 10Btullis) [10:29:18] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38420/console" [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) (owner: 10Btullis) [10:32:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P40968 and previous config saved to /var/cache/conftool/dbconfig/20221124-103207-marostegui.json [10:32:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 (owner: 10Giuseppe Lavagetto) [10:32:47] (03PS1) 10Filippo Giunchedi: dcops: switch mgmt down alerts to open tasks [alerts] - 10https://gerrit.wikimedia.org/r/860525 (https://phabricator.wikimedia.org/T310266) [10:33:00] (03CR) 10Muehlenhoff: [C: 03+2] hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:37:05] (03Merged) 10jenkins-bot: developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 (owner: 10Giuseppe Lavagetto) [10:41:30] (03CR) 10Filippo Giunchedi: [C: 03+1] spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [10:41:54] !log reboot rdb1010, rdb1012, rdb2008, rdb2010 for kerne upgrades. All are redis replicas, there should be no impact. [10:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:33] PROBLEM - Host rdb2010 is DOWN: PING CRITICAL - Packet loss = 100% [10:44:53] RECOVERY - Host rdb2010 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [10:47:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P40969 and previous config saved to /var/cache/conftool/dbconfig/20221124-104714-marostegui.json [10:51:33] (03PS1) 10Giuseppe Lavagetto: datahub: convert subcharts to modules too [deployment-charts] - 10https://gerrit.wikimedia.org/r/860530 [10:51:39] (03CR) 10Alexandros Kosiaris: "Minor question, but overall LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [10:52:53] (03CR) 10CI reject: [V: 04-1] datahub: convert subcharts to modules too [deployment-charts] - 10https://gerrit.wikimedia.org/r/860530 (owner: 10Giuseppe Lavagetto) [10:57:09] (03PS2) 10Giuseppe Lavagetto: datahub: convert subcharts to modules too [deployment-charts] - 10https://gerrit.wikimedia.org/r/860530 [10:57:13] (03CR) 10Jbond: [C: 03+1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [11:00:04] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1100). [11:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T321126)', diff saved to https://phabricator.wikimedia.org/P40970 and previous config saved to /var/cache/conftool/dbconfig/20221124-110220-marostegui.json [11:02:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1116.eqiad.wmnet with reason: Maintenance [11:02:28] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:02:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1116.eqiad.wmnet with reason: Maintenance [11:02:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:02:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) (owner: 10Btullis) [11:02:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T321126)', diff saved to https://phabricator.wikimedia.org/P40971 and previous config saved to /var/cache/conftool/dbconfig/20221124-110258-marostegui.json [11:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T321126)', diff saved to https://phabricator.wikimedia.org/P40972 and previous config saved to /var/cache/conftool/dbconfig/20221124-110405-marostegui.json [11:05:00] (03PS1) 10Jbond: Revert "pki: move root common settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/860488 [11:05:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] datahub: convert subcharts to modules too [deployment-charts] - 10https://gerrit.wikimedia.org/r/860530 (owner: 10Giuseppe Lavagetto) [11:05:57] (03CR) 10CI reject: [V: 04-1] Revert "pki: move root common settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/860488 (owner: 10Jbond) [11:06:20] (03CR) 10Btullis: [V: 03+1] "For future reference: the linked ticket was created by @ottomata and so his approval was inferred from this fact. https://phabricator.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) (owner: 10Btullis) [11:06:26] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add dcausse and gmodena to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/860523 (https://phabricator.wikimedia.org/T323280) (owner: 10Btullis) [11:07:25] (03PS2) 10Jbond: Revert "pki: move root common settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/860488 [11:07:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "pki: move root common settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/860488 (owner: 10Jbond) [11:10:12] (03Merged) 10jenkins-bot: datahub: convert subcharts to modules too [deployment-charts] - 10https://gerrit.wikimedia.org/r/860530 (owner: 10Giuseppe Lavagetto) [11:10:40] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: update documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/860074 (owner: 10Jbond) [11:10:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [11:11:04] (03PS4) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 [11:14:06] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) 05Open→03Resolved @dcausse, @gmodena - Welcome to the `analytics-admins` group! Please take suitable care with your... [11:16:19] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:18:09] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [11:18:15] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [11:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P40973 and previous config saved to /var/cache/conftool/dbconfig/20221124-111912-marostegui.json [11:22:36] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:25:37] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:28:50] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:28:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:31:18] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [11:31:25] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [11:31:39] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [11:31:44] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [11:33:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:34:11] (03PS1) 10Giuseppe Lavagetto: datahub: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/860534 [11:34:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P40974 and previous config saved to /var/cache/conftool/dbconfig/20221124-113418-marostegui.json [11:34:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] datahub: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/860534 (owner: 10Giuseppe Lavagetto) [11:39:13] (03Merged) 10jenkins-bot: datahub: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/860534 (owner: 10Giuseppe Lavagetto) [11:39:33] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:40:04] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:43:06] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [11:43:12] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [11:44:37] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:45:38] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:46:35] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:48:12] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T321126)', diff saved to https://phabricator.wikimedia.org/P40976 and previous config saved to /var/cache/conftool/dbconfig/20221124-114925-marostegui.json [11:49:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:49:32] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:49:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:49:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:49:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:50:02] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [11:50:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T321126)', diff saved to https://phabricator.wikimedia.org/P40977 and previous config saved to /var/cache/conftool/dbconfig/20221124-115004-marostegui.json [11:51:04] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:52:41] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1044: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860535 (https://phabricator.wikimedia.org/T319184) [11:52:44] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:56:01] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8417608, @MoritzMuehlenhoff wrote: >>>! In T308677#8346... [11:57:58] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1044: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860535 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:58:15] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) > This seems to be a more generic issue with partman creating the sow... [11:59:02] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bullseye [11:59:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1044.eqiad.wmnet with O... [11:59:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1044: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860535 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:05:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321126)', diff saved to https://phabricator.wikimedia.org/P40978 and previous config saved to /var/cache/conftool/dbconfig/20221124-120514-marostegui.json [12:05:21] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:07:57] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [12:08:04] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [12:08:35] * jbond overly optomistic this is the one that will work [12:09:13] * volans crossing fingers ;) [12:12:24] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [12:12:48] * jbond thanks vol.ans [12:13:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:15:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [12:17:34] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [12:17:40] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [12:18:20] * jbond :( naughty d-i why have you decided you no longer have driveres for the controler :@ ! [12:18:27] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [12:18:33] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [12:18:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:13] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [12:20:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P40979 and previous config saved to /var/cache/conftool/dbconfig/20221124-122020-marostegui.json [12:22:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on idp-test1002.wikimedia.org with reason: Testing some changes, service will be down from time to time [12:23:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on idp-test1002.wikimedia.org with reason: Testing some changes, service will be down from time to time [12:24:04] (03PS3) 10Slyngshede: Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 [12:24:13] (03CR) 10Slyngshede: Allow configuration from json file. (032 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 (owner: 10Slyngshede) [12:25:16] (03PS4) 10Slyngshede: Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 [12:30:55] (03PS5) 10Slyngshede: Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 [12:30:59] (03PS2) 10Volans: tox.ini: drop support for python3.7/3.8 [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond) [12:32:25] (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond) [12:33:46] (03CR) 10Muehlenhoff: Enable profile::auto_restarts::service for virtlogd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:35:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P40980 and previous config saved to /var/cache/conftool/dbconfig/20221124-123527-marostegui.json [12:37:22] (03PS2) 10Slyngshede: Fix typing to allow Python 3.7 support. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 [12:37:52] (03CR) 10Slyngshede: Fix typing to allow Python 3.7 support. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede) [12:38:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bullseye [12:38:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bu... [12:42:10] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [12:42:16] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [12:42:50] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [12:42:56] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [12:46:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede) [12:47:59] (03CR) 10Slyngshede: [V: 03+2] Fix typing to allow Python 3.7 support. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede) [12:48:01] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Fix typing to allow Python 3.7 support. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede) [12:48:31] (03PS6) 10Slyngshede: Allow configuration from json file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 [12:50:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321126)', diff saved to https://phabricator.wikimedia.org/P40981 and previous config saved to /var/cache/conftool/dbconfig/20221124-125033-marostegui.json [12:50:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:50:41] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:50:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:50:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:51:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:51:07] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/860517 (owner: 10Giuseppe Lavagetto) [12:51:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T321126)', diff saved to https://phabricator.wikimedia.org/P40982 and previous config saved to /var/cache/conftool/dbconfig/20221124-125111-marostegui.json [12:52:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T321126)', diff saved to https://phabricator.wikimedia.org/P40983 and previous config saved to /var/cache/conftool/dbconfig/20221124-125218-marostegui.json [12:56:40] (03PS11) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [12:57:01] (03PS1) 10Muehlenhoff: Migrate service definitions to CasRegisteredService [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) [12:57:28] (03PS2) 10Jbond: install_server: fix config for ms-be dynamic partition [puppet] - 10https://gerrit.wikimedia.org/r/860114 [13:01:14] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [13:02:48] !log installing glibc security updates on buster [13:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] (03PS4) 10Stang: zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859) [13:04:18] jouncebot: next [13:04:18] In 0 hour(s) and 55 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1400) [13:04:18] In 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1400) [13:04:45] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [13:07:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P40984 and previous config saved to /var/cache/conftool/dbconfig/20221124-130725-marostegui.json [13:07:29] (03CR) 10David Caro: [C: 03+1] Enable profile::auto_restarts::service for virtlogd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:09:08] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [13:09:35] Lucas_WMDE, urbanecm: I see there's a couple of patches in the upcoming backport window, backports yesterday were affected by https://phabricator.wikimedia.org/T323735 [13:09:54] the problem should be fixed now, but I'll be around in case it happens again [13:10:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [13:11:41] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [13:12:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [13:13:48] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [13:16:03] (03PS1) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-eventschemas: Also restart envoyproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/860556 [13:16:58] ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Added more downtime, but replacement batteries are on their way https://wikitech.wikimedia.org/wiki/MegaCli%23 [13:16:58] ng [13:18:16] (03PS1) 10Arturo Borrero Gonzalez: wmcs: proxy: resolve home directory in the puppet ca path [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860557 [13:18:47] (03CR) 10Jbond: "lgtm, will ping the task when cloud is updated. We could possibly also use the cas_version fact if that ends up taking longer for some re" [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [13:20:31] (03CR) 10David Caro: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860557 (owner: 10Arturo Borrero Gonzalez) [13:21:16] (03CR) 10Muehlenhoff: Migrate service definitions to CasRegisteredService (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [13:22:18] (03CR) 10FNegri: [C: 03+1] wmcs: proxy: resolve home directory in the puppet ca path [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860557 (owner: 10Arturo Borrero Gonzalez) [13:22:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P40985 and previous config saved to /var/cache/conftool/dbconfig/20221124-132231-marostegui.json [13:22:35] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2050.codfw.wmnet with OS bullseye [13:22:41] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [13:22:46] (03CR) 10David Caro: [C: 03+2] wmcs: proxy: resolve home directory in the puppet ca path [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860557 (owner: 10Arturo Borrero Gonzalez) [13:22:50] (03PS5) 10Stang: zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859) [13:26:25] (03Merged) 10jenkins-bot: wmcs: proxy: resolve home directory in the puppet ca path [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860557 (owner: 10Arturo Borrero Gonzalez) [13:28:33] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Envoy on planet [puppet] - 10https://gerrit.wikimedia.org/r/860560 (https://phabricator.wikimedia.org/T135991) [13:30:20] !log restarting slapd on serpens/seaborgium [13:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:47] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [13:30:54] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [13:37:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T321126)', diff saved to https://phabricator.wikimedia.org/P40986 and previous config saved to /var/cache/conftool/dbconfig/20221124-133738-marostegui.json [13:37:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:37:44] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [13:37:48] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:37:51] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [13:37:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:38:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T321126)', diff saved to https://phabricator.wikimedia.org/P40987 and previous config saved to /var/cache/conftool/dbconfig/20221124-133759-marostegui.json [13:38:43] !log btullis@cumin1001 Added views for new wiki: igwiktionary T314645 [13:38:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [13:38:49] T314645: Prepare and check storage layer for igwiktionary - https://phabricator.wikimedia.org/T314645 [13:39:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T321126)', diff saved to https://phabricator.wikimedia.org/P40988 and previous config saved to /var/cache/conftool/dbconfig/20221124-133907-marostegui.json [13:39:22] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for virtlogd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:43:45] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [13:43:48] (03PS1) 10Arturo Borrero Gonzalez: wmcs: proxy: use port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860563 [13:43:52] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [13:52:41] (03PS1) 10Muehlenhoff: Add a cookbook to restart/reboot ncredir nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 [13:53:51] !log Removed unused and expiring kafka_jumbo certificates. T323697 [13:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:58] T323697: Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 [13:54:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P40989 and previous config saved to /var/cache/conftool/dbconfig/20221124-135413-marostegui.json [13:55:23] (03PS2) 10Filippo Giunchedi: dcops: switch mgmt down alerts to open tasks [alerts] - 10https://gerrit.wikimedia.org/r/860525 (https://phabricator.wikimedia.org/T310266) [13:59:19] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [13:59:45] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1400). [14:00:05] cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:39] o/ [14:03:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] changeprop: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860517 (owner: 10Giuseppe Lavagetto) [14:05:15] (03CR) 10Clément Goubert: Add a new production image for otelcol (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [14:06:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kask: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860515 (owner: 10Giuseppe Lavagetto) [14:08:09] (03Merged) 10jenkins-bot: changeprop: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860517 (owner: 10Giuseppe Lavagetto) [14:09:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P40990 and previous config saved to /var/cache/conftool/dbconfig/20221124-140920-marostegui.json [14:10:11] (03CR) 10Jbond: [C: 03+2] install_server: fix config for ms-be dynamic partition [puppet] - 10https://gerrit.wikimedia.org/r/860114 (owner: 10Jbond) [14:10:33] (03Merged) 10jenkins-bot: kask: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860515 (owner: 10Giuseppe Lavagetto) [14:10:45] moritzm: ok to merge yours [14:11:17] sorry, yes please go ahead [14:11:31] done [14:11:53] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [14:11:59] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [14:13:14] o/ [14:13:23] looks like nobody’s doing the backport window yet? [14:13:32] in which case I can deploy [14:13:46] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [14:13:53] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [14:14:06] cirno: new nickname? ^^ [14:14:21] :P [14:15:10] (03PS1) 10Slyngshede: WIP C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [14:15:56] (03CR) 10Giuseppe Lavagetto: "Mostly a doubt about the chosen UID. Otherwise lgtm." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [14:15:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/860556 (owner: 10Muehlenhoff) [14:16:26] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus-ipmi-exporter [puppet] - 10https://gerrit.wikimedia.org/r/860569 (https://phabricator.wikimedia.org/T135991) [14:17:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860117 (https://phabricator.wikimedia.org/T323734) (owner: 10Stang) [14:18:11] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:18:50] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:19:12] (03Merged) 10jenkins-bot: wikidatawiki: Add language-specific logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860117 (https://phabricator.wikimedia.org/T323734) (owner: 10Stang) [14:19:32] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:860117|wikidatawiki: Add language-specific logos (T323734)]] [14:19:38] T323734: Move language-specific logos from Commons.css to logos.php at wikidatawiki - https://phabricator.wikimedia.org/T323734 [14:20:52] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:860117|wikidatawiki: Add language-specific logos (T323734)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:20:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:21:17] cirno: please test [14:21:30] looking [14:21:32] for my part, I see an Arabic logo on https://www.wikidata.org/wiki/Wikidata:Main_Page?uselang=ar&safemode=1 now (safemode to bypass common.css) that wasn’t there before, so that part seems to be working [14:22:30] I tested all 9 sites mentioned in this patch, all of them looks fine to me [14:22:40] hm, the English logo gets smaller on my end [14:22:52] (03CR) 10Giuseppe Lavagetto: Add a new production image for otelcol (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [14:23:21] left is mwdebug, right without https://usercontent.irccloud-cdn.com/file/KRyOK27P/image.png [14:23:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:24:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T321126)', diff saved to https://phabricator.wikimedia.org/P40991 and previous config saved to /var/cache/conftool/dbconfig/20221124-142426-marostegui.json [14:24:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance [14:24:33] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:24:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:24:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:24:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance [14:24:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40992 and previous config saved to /var/cache/conftool/dbconfig/20221124-142447-marostegui.json [14:24:55] it is smaller, the old logo's width is larger than 135px [14:25:10] (03CR) 10Giuseppe Lavagetto: "Couple details, LGTM otherwise." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [14:25:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:25:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:26:08] yeah [14:26:28] and nothing in the task or commit message said that making the logo smaller was supposed to be part of it :/ [14:27:35] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:27:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40993 and previous config saved to /var/cache/conftool/dbconfig/20221124-142756-marostegui.json [14:28:31] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:28:52] strange, from my side this does not change the appearance, (the right for mwdebug1001 https://usercontent.irccloud-cdn.com/file/cuIEoPto/image.png [14:29:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860508 (owner: 10Slyngshede) [14:29:05] cirno: did you force-reload as well? [14:29:06] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:29:25] yeah, I pressed Ctrl+Shift+R [14:29:38] (03PS1) 10Filippo Giunchedi: icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) [14:29:40] (03PS1) 10Filippo Giunchedi: icinga: move mgmt_parents to icinga [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) [14:29:42] (03PS1) 10Filippo Giunchedi: hieradata: remove mgmt_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) [14:29:56] ah, I think you’re at 150% zoom or something? [14:30:03] at 150% the difference vanishes on my end too [14:30:24] same for 200% [14:30:28] it only exists on 100%, it seems [14:30:29] I'm at 200% zoom [14:30:57] s/it only exists/the difference only exists/ [14:31:44] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [14:31:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:32:37] I’ll sync the change now [14:32:39] (03CR) 10CI reject: [V: 04-1] icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:33:17] (03CR) 10CI reject: [V: 04-1] icinga: move mgmt_parents to icinga [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:33:29] I see, if unclick the `background-size`rule for 150%, it does become bigger [14:33:38] https://usercontent.irccloud-cdn.com/file/PC0BK2dv/image.png [14:34:09] thanks [14:34:17] I don’t see that rule, strange [14:34:21] ah [14:34:25] webkit-min-device-pixel-ratio [14:34:33] apple hidpi screen shenanigans I guess [14:35:13] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [14:35:18] oh wait, now I see the rule you mean [14:35:21] at 150% zoom still [14:35:22] yeah [14:36:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:36:56] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:860117|wikidatawiki: Add language-specific logos (T323734)]] (duration: 17m 24s) [14:37:02] (03PS2) 10Filippo Giunchedi: icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) [14:37:02] T323734: Move language-specific logos from Commons.css to logos.php at wikidatawiki - https://phabricator.wikimedia.org/T323734 [14:37:04] (03PS2) 10Filippo Giunchedi: icinga: move mgmt_parents to icinga [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) [14:37:06] (03PS2) 10Filippo Giunchedi: hieradata: remove mgmt_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) [14:37:44] (03CR) 10David Caro: [C: 03+1] "LGTM, does this work for you? (looks like it should)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860563 (owner: 10Arturo Borrero Gonzalez) [14:37:55] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/wikidatawiki%s.png\n' '' '-1.5x' '-2x' | mwscript purgeList.php # T323734 [14:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:40:16] cirno: in the zhwiki change, is it intentional that the non-20y SVGs also change? [14:40:34] I looked at wikipedia-tagline-zh-hans.svg and it even seems to change script [14:42:17] I just re-download those file from commons and compress them, it should not change something [14:43:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P40994 and previous config saved to /var/cache/conftool/dbconfig/20221124-144303-marostegui.json [14:43:13] if I look at `static/images/mobile/copyright/wikipedia-tagline-zh-hans.svg` on master, it’s definitely in some Chinese script (I can’t say if simplified or traditional) [14:43:21] oh wait [14:43:24] no, sorry [14:43:27] the script doesn’t change [14:43:34] eog just silently moves ahead to the next file?? [14:43:39] so I was looking at `zh_min_nan.svg` now [14:43:44] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for envoyproxy on Grafana [puppet] - 10https://gerrit.wikimedia.org/r/860576 (https://phabricator.wikimedia.org/T135991) [14:43:51] let me look again [14:43:59] 0_o [14:44:47] okay, they are identical [14:44:50] eog just confused me [14:44:55] (gnome’s image viewer… “eye of gnome”) [14:45:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mathoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860509 (owner: 10Giuseppe Lavagetto) [14:45:19] (I use a vscode plugin to preview those files [14:45:51] (03PS6) 10Lucas Werkmeister (WMDE): zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [14:46:39] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:47:01] (03CR) 10Vgutierrez: Add a cookbook to restart/reboot ncredir nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/860564 (owner: 10Muehlenhoff) [14:47:23] (I actually wanted to open both copies in eog, but after I opened eog on master and downloaded the change, I noticed that eog now showed something else) [14:47:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] linkrecommendation: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860519 (owner: 10Giuseppe Lavagetto) [14:47:45] (so i just assumed that it was now showing the new file content, when it actually showed a different file – perhaps because Git momentarily removed the old file before creating the new version) [14:48:41] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:49:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:50:45] !log updating package otelcol-contrib to 0.66.0 in component thirdparty/otelcol-contrib [14:50:48] wow zuul is very busy at the moment [14:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] cirno: fyi I’m waiting for the test build to pass before I +2 the zh change [14:51:05] yeah, queued for 4 minutes... [14:51:46] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:52:17] huge depends-on chain at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/860125 apparently [14:52:24] (and all that for a change that’s DO NOT MERGE) [14:52:24] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:52:28] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:52:31] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2050.codfw.wmnet with OS bullseye [14:52:34] ok now it’s running [14:52:45] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [14:53:20] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:53:28] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:53:40] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:54:47] <_joe_> jelto: ^^ [14:54:53] <_joe_> seems like zuul is stuck [14:54:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:55:41] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Envoy on debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/860579 (https://phabricator.wikimedia.org/T135991) [14:55:47] <_joe_> hashar: you as well :) [14:55:48] (03Merged) 10jenkins-bot: mathoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860509 (owner: 10Giuseppe Lavagetto) [14:56:07] completely stuck or just very full? [14:56:07] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for envoyproxy on Grafana [puppet] - 10https://gerrit.wikimedia.org/r/860576 (https://phabricator.wikimedia.org/T135991) [14:56:10] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:56:14] I just saw a patch leave gate-and-submit fwiw [14:56:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [14:56:36] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:56:37] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:57:26] <_joe_> Lucas_WMDE: it's very slow [14:57:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:57:37] <_joe_> and I think some qwueues are actually stuck [14:57:42] ok [14:57:47] <_joe_> but I don't want to start debugging zuul tbh :P [14:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P40995 and previous config saved to /var/cache/conftool/dbconfig/20221124-145810-marostegui.json [14:58:19] !log rebalance Ganeti group C/eqiad T311687 [14:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:24] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [14:58:25] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:58:29] (03Merged) 10jenkins-bot: zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [14:59:04] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:59:26] oh damn, I +2ed instead of using scap backport [14:59:27] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [14:59:30] oh well, let’s do it manually [14:59:43] (03Merged) 10jenkins-bot: linkrecommendation: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860519 (owner: 10Giuseppe Lavagetto) [14:59:47] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [15:00:02] cirno: the change is on mwdebug1001 (and only there), can you test it? [15:00:23] (can’t be bothered to SSH into the other three mwdebug servers where `scap backport` would also have deployed the change for testing) [15:00:32] looking [15:00:35] (03PS1) 10Hnowlan: Add tinyrgb colour profile [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860580 (https://phabricator.wikimedia.org/T233196) [15:00:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:01:07] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [15:01:19] (03PS7) 10Clément Goubert: Add a new production image for otelcol [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) [15:01:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for envoyproxy on Grafana [puppet] - 10https://gerrit.wikimedia.org/r/860576 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:01:35] Lucas_WMDE: tested on legacy vector and vector-2022, both looks fine to me [15:01:48] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [15:01:49] great, thanks [15:02:26] oh damn and the window is already over [15:02:27] jouncebot: now [15:02:27] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [15:02:28] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:29] phew [15:02:36] those three syncs will take a bit [15:02:54] I’m syncing config.yaml first (which I suspect is a prod noop), then logos.php, then static/ [15:03:13] so that the old logos.php doesn’t reference files that are already being deleted [15:03:21] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:03:53] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:03:55] (03PS8) 10Clément Goubert: Add a new production image for otelcol [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) [15:04:06] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [15:04:28] (03CR) 10Clément Goubert: Add a new production image for otelcol (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [15:04:34] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [15:05:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:06:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:06:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:06:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:07:02] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:858709|zhwiki: Revert 20 years logos (T320859)]] (1/3) (duration: 04m 41s) [15:07:07] T320859: Requesting temporary logo change for zh.wikipedia.org - https://phabricator.wikimedia.org/T320859 [15:07:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:09:03] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:48] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:858709|zhwiki: Revert 20 years logos (T320859)]] (2/3) (duration: 04m 34s) [15:12:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:13:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:13:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:13:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40996 and previous config saved to /var/cache/conftool/dbconfig/20221124-151316-marostegui.json [15:13:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1192.eqiad.wmnet with reason: Maintenance [15:13:28] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:13:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1192.eqiad.wmnet with reason: Maintenance [15:13:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T321126)', diff saved to https://phabricator.wikimedia.org/P40997 and previous config saved to /var/cache/conftool/dbconfig/20221124-151338-marostegui.json [15:13:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:14:43] (03PS1) 10Jbond: install_server: migrate ms-bs_simple top GPT [puppet] - 10https://gerrit.wikimedia.org/r/860581 (https://phabricator.wikimedia.org/T308677) [15:14:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T321126)', diff saved to https://phabricator.wikimedia.org/P40998 and previous config saved to /var/cache/conftool/dbconfig/20221124-151445-marostegui.json [15:16:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/: Config: [[gerrit:858709|zhwiki: Revert 20 years logos (T320859)]] (3/3) (duration: 04m 43s) [15:17:00] T320859: Requesting temporary logo change for zh.wikipedia.org - https://phabricator.wikimedia.org/T320859 [15:17:18] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-%s.svg\n' {tagline-zh{,-hans},wordmark-zh-hans} | mwscript purgeList.php # T320859 [15:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:37] cirno: can you quickly check that things aren’t horribly broken without mwdebug now? ^^ [15:18:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: proxy: use port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860563 (owner: 10Arturo Borrero Gonzalez) [15:18:31] Lucas_WMDE: it works fine from my side [15:18:39] (03PS1) 10TK-999: mcrouter: Specify missing CXXFLAGS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 [15:19:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:19:08] yay thanks [15:19:13] !log UTC afternoon backport+config window done [15:19:17] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [15:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:19:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:20:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:21:31] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [15:24:55] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [15:25:02] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [15:25:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [15:25:54] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [15:29:15] (03PS6) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [15:29:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P40999 and previous config saved to /var/cache/conftool/dbconfig/20221124-152952-marostegui.json [15:30:32] !log Started deployment of refinery as part of weekly deployment train [15:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:39] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [15:32:14] !log ebysans@deploy1002 Started deploy [analytics/refinery@1bfb89f]: Regular analytics weekly train [analytics/refinery@1bfb89f] [15:32:28] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) ok with T308677#8419843 and T308677#8420119 i have now managed to succ... [15:32:48] Emperor: FYI ^^^ [15:41:13] (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for role analytics_cluster::coordinator::replica [puppet] - 10https://gerrit.wikimedia.org/r/860608 [15:41:20] !log ebysans@deploy1002 Finished deploy [analytics/refinery@1bfb89f]: Regular analytics weekly train [analytics/refinery@1bfb89f] (duration: 09m 06s) [15:42:41] !log ebysans@deploy1002 Started deploy [analytics/refinery@1bfb89f] (thin): Regular analytics weekly train THIN [analytics/refinery@1bfb89f] [15:42:45] (03CR) 10CI reject: [V: 04-1] Set profile::contacts::role_contacts for role analytics_cluster::coordinator::replica [puppet] - 10https://gerrit.wikimedia.org/r/860608 (owner: 10Muehlenhoff) [15:42:48] !log ebysans@deploy1002 Finished deploy [analytics/refinery@1bfb89f] (thin): Regular analytics weekly train THIN [analytics/refinery@1bfb89f] (duration: 00m 07s) [15:43:15] !log ebysans@deploy1002 Started deploy [analytics/refinery@1bfb89f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1bfb89f] [15:44:15] (03PS2) 10Muehlenhoff: Set role_contacts for role analytics_cluster::coordinator::replica [puppet] - 10https://gerrit.wikimedia.org/r/860608 [15:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P41000 and previous config saved to /var/cache/conftool/dbconfig/20221124-154458-marostegui.json [15:45:16] !log ebysans@deploy1002 Finished deploy [analytics/refinery@1bfb89f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1bfb89f] (duration: 02m 00s) [15:45:53] (03PS1) 10Filippo Giunchedi: o11y: more lenient logstash kafka consumer lag [alerts] - 10https://gerrit.wikimedia.org/r/860609 [15:49:26] (03CR) 10Filippo Giunchedi: "Thank you Andrea! I'll let you merge as needed" [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi) [15:49:30] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: neutron: mark several commands as safe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860610 [15:50:31] (03PS1) 10Ssingh: Release 0.35-2 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/860612 (https://phabricator.wikimedia.org/T321309) [15:50:46] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: neutron: mark several commands as safe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860610 [15:54:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:56:05] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:58:55] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:00] (03CR) 10Jcrespo: [C: 03+1] admin: add dpujol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi) [16:00:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T321126)', diff saved to https://phabricator.wikimedia.org/P41001 and previous config saved to /var/cache/conftool/dbconfig/20221124-160005-marostegui.json [16:00:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1193.eqiad.wmnet with reason: Maintenance [16:00:12] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:00:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1193.eqiad.wmnet with reason: Maintenance [16:00:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T321126)', diff saved to https://phabricator.wikimedia.org/P41002 and previous config saved to /var/cache/conftool/dbconfig/20221124-160026-marostegui.json [16:02:02] (03CR) 10David Caro: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860610 (owner: 10Arturo Borrero Gonzalez) [16:02:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T321126)', diff saved to https://phabricator.wikimedia.org/P41003 and previous config saved to /var/cache/conftool/dbconfig/20221124-160234-marostegui.json [16:08:25] _joe_: sorry I missed your ping about zuul. [16:08:38] looks like it had a spam of change https://grafana.wikimedia.org/d/000000322/zuul-gearman?viewPanel=10&from=now-24h&to=now&orgId=1 [16:08:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: neutron: mark several commands as safe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860610 (owner: 10Arturo Borrero Gonzalez) [16:09:11] it will processes even eventually [16:09:55] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:10:09] (03CR) 10Vgutierrez: [C: 04-1] "we need to bump dependencies in setup.py first" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/860612 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:11:49] !log killed webrequest-druid-hourly-coord for restart as part of weekly deployment train [16:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:16] !log successfully restarted webrequest-druid-hourly-coord for restart as part of weekly deployment train. [16:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:32] !log killed webrequest-druid-daily-coord for restart as part of weekly deployment train. [16:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:48] (03CR) 10MVernon: [C: 03+1] "This looks good to me now, thank you so much for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:17:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P41004 and previous config saved to /var/cache/conftool/dbconfig/20221124-161741-marostegui.json [16:18:19] (03PS12) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [16:22:17] !log successfully restarted webrequest-druid-daily-coord as part of weekly deployment train. [16:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:01] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) [16:29:22] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) ms-be2050 looks good to me now, thank you :) I think any appr... [16:29:36] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow arbitrary flavor names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 [16:29:59] (03CR) 10Daniel Kinzler: api-gateway: expose restbase /api/ endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [16:30:33] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) 05Open→03In progress [16:30:37] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [16:32:04] (03PS3) 10David Caro: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [16:32:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P41006 and previous config saved to /var/cache/conftool/dbconfig/20221124-163247-marostegui.json [16:34:37] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow arbitrary flavor names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 [16:41:44] (03PS13) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [16:44:55] (03CR) 10Vgutierrez: [C: 04-1] toolforge harbor: update certs with acmechief (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [16:45:51] (03CR) 10David Caro: "I would avoid giving this option until is actually needed. That helps avoid creating flavors with custom names unless it's strictly necess" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 (owner: 10Arturo Borrero Gonzalez) [16:47:53] (03PS1) 10David Caro: harbor: remove support for !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T321126)', diff saved to https://phabricator.wikimedia.org/P41008 and previous config saved to /var/cache/conftool/dbconfig/20221124-164754-marostegui.json [16:47:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1203.eqiad.wmnet with reason: Maintenance [16:48:01] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:48:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1203.eqiad.wmnet with reason: Maintenance [16:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T321126)', diff saved to https://phabricator.wikimedia.org/P41009 and previous config saved to /var/cache/conftool/dbconfig/20221124-164815-marostegui.json [16:49:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T321126)', diff saved to https://phabricator.wikimedia.org/P41010 and previous config saved to /var/cache/conftool/dbconfig/20221124-164923-marostegui.json [16:49:56] (03CR) 10David Caro: toolforge harbor: update certs with acmechief (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [16:50:29] (03CR) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow arbitrary flavor names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 (owner: 10Arturo Borrero Gonzalez) [16:53:47] (03PS7) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [16:54:30] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [16:55:29] (03CR) 10David Caro: wmcs: openstack: common: allow arbitrary flavor names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 (owner: 10Arturo Borrero Gonzalez) [16:55:44] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860579 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:56:53] (03CR) 10David Caro: wmcs: openstack: common: allow arbitrary flavor names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 (owner: 10Arturo Borrero Gonzalez) [16:57:21] (03PS8) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [16:58:31] (03CR) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow arbitrary flavor names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 (owner: 10Arturo Borrero Gonzalez) [16:58:34] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: openstack: common: allow arbitrary flavor names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860621 (owner: 10Arturo Borrero Gonzalez) [16:58:56] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) [16:59:34] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [16:59:36] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) [16:59:49] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:09] (03PS1) 10Urbanecm: GrowthExperiments: Remove non-existent variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860624 [17:01:34] * urbanecm is going to push some cleanup live [17:01:36] !log urbanecm@deploy1002 backport aborted: (duration: 00m 01s) [17:01:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860624 (owner: 10Urbanecm) [17:01:53] (03PS1) 10Clément Goubert: C:vopsbot: Notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/860625 [17:03:02] (03Merged) 10jenkins-bot: GrowthExperiments: Remove non-existent variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860624 (owner: 10Urbanecm) [17:03:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:860624|GrowthExperiments: Remove non-existent variables]] [17:03:43] (03PS2) 10Urbanecm: GrowthExperiments: Remove unused GEHomepageNewAccountVariants config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 (owner: 10Kosta Harlan) [17:04:25] (03CR) 10Urbanecm: [C: 03+2] "beta-only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 (owner: 10Kosta Harlan) [17:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P41011 and previous config saved to /var/cache/conftool/dbconfig/20221124-170429-marostegui.json [17:04:58] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38427/console" [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert) [17:05:35] (03PS1) 10David Caro: harbor: remove unused harbor::db module/role [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) [17:05:39] (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused GEHomepageNewAccountVariants config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 (owner: 10Kosta Harlan) [17:05:44] (03CR) 10Clément Goubert: C:vopsbot: Notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert) [17:06:01] (03CR) 10Clément Goubert: [V: 03+1] "PCC OK" [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert) [17:07:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:07:56] (03PS1) 10Hnowlan: thumbor: correct tinyrgb path [deployment-charts] - 10https://gerrit.wikimedia.org/r/860628 (https://phabricator.wikimedia.org/T323775) [17:08:16] (03PS2) 10Hnowlan: Add tinyrgb colour profile [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860580 (https://phabricator.wikimedia.org/T323775) [17:08:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:08:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:08:22] (03PS9) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:08:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:860624|GrowthExperiments: Remove non-existent variables]] (duration: 05m 25s) [17:09:05] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [17:09:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 (owner: 10Kosta Harlan) [17:09:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:14:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:14:50] (03PS10) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:15:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:15:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:15:58] (03PS11) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:16:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:18:17] (03PS1) 10Hnowlan: Fix TypeError when prepending string to STL files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860632 (https://phabricator.wikimedia.org/T323781) [17:19:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P41012 and previous config saved to /var/cache/conftool/dbconfig/20221124-171936-marostegui.json [17:22:25] (03PS12) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:26:29] (03PS13) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:30:06] (03PS14) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:32:11] (03PS1) 10Hnowlan: maps: remove Cassandra and Tilerator service [puppet] - 10https://gerrit.wikimedia.org/r/860634 (https://phabricator.wikimedia.org/T298246) [17:32:21] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [17:34:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T321126)', diff saved to https://phabricator.wikimedia.org/P41013 and previous config saved to /var/cache/conftool/dbconfig/20221124-173442-marostegui.json [17:34:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:34:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:34:49] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:34:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance [17:35:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance [17:35:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance [17:35:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance [17:35:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2152.codfw.wmnet with reason: Maintenance [17:35:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2152.codfw.wmnet with reason: Maintenance [17:35:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T321126)', diff saved to https://phabricator.wikimedia.org/P41014 and previous config saved to /var/cache/conftool/dbconfig/20221124-173556-marostegui.json [17:37:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T321126)', diff saved to https://phabricator.wikimedia.org/P41015 and previous config saved to /var/cache/conftool/dbconfig/20221124-173706-marostegui.json [17:37:47] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38435/console" [puppet] - 10https://gerrit.wikimedia.org/r/860634 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [17:42:58] (03CR) 10Vlad.shapik: [C: 03+1] Fix TypeError when prepending string to STL files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860632 (https://phabricator.wikimedia.org/T323781) (owner: 10Hnowlan) [17:52:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P41016 and previous config saved to /var/cache/conftool/dbconfig/20221124-175212-marostegui.json [17:55:19] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/860576 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:59:13] (03PS15) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [18:00:05] bd808: How many deployers does it take to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T1800). [18:01:41] (03PS1) 10MSantos: Bump proton to 2022-11-24-154643-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/860635 [18:05:50] (03CR) 10Hnowlan: api-gateway: expose restbase /api/ endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [18:07:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P41017 and previous config saved to /var/cache/conftool/dbconfig/20221124-180719-marostegui.json [18:07:41] (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860634 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [18:08:55] (03PS2) 10Hnowlan: api-gateway: expose restbase /api/ endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) [18:11:48] (03CR) 10MSantos: [C: 03+2] Bump proton to 2022-11-24-154643-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/860635 (owner: 10MSantos) [18:12:18] (03CR) 10Hnowlan: [C: 03+2] Fix TypeError when prepending string to STL files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860632 (https://phabricator.wikimedia.org/T323781) (owner: 10Hnowlan) [18:13:20] (03PS30) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [18:15:17] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [18:16:20] (03Merged) 10jenkins-bot: Bump proton to 2022-11-24-154643-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/860635 (owner: 10MSantos) [18:17:24] (03Merged) 10jenkins-bot: Fix TypeError when prepending string to STL files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860632 (https://phabricator.wikimedia.org/T323781) (owner: 10Hnowlan) [18:19:04] (03CR) 10Vlad.shapik: [C: 03+1] thumbor: correct tinyrgb path [deployment-charts] - 10https://gerrit.wikimedia.org/r/860628 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [18:19:23] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [18:20:22] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [18:21:20] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [18:21:32] (03CR) 10Vlad.shapik: [C: 03+1] Add tinyrgb colour profile [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860580 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [18:22:18] (03Abandoned) 10Ssingh: Release 0.35-2 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/860612 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T321126)', diff saved to https://phabricator.wikimedia.org/P41018 and previous config saved to /var/cache/conftool/dbconfig/20221124-182225-marostegui.json [18:22:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:22:32] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:22:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T321126)', diff saved to https://phabricator.wikimedia.org/P41019 and previous config saved to /var/cache/conftool/dbconfig/20221124-182247-marostegui.json [18:22:58] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [18:23:29] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [18:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T321126)', diff saved to https://phabricator.wikimedia.org/P41020 and previous config saved to /var/cache/conftool/dbconfig/20221124-182457-marostegui.json [18:25:40] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [18:28:45] (03CR) 10Hnowlan: [C: 03+2] Add tinyrgb colour profile [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860580 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [18:34:01] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1256115 MB (15% inode=68%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [18:34:59] (03Merged) 10jenkins-bot: Add tinyrgb colour profile [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/860580 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [18:37:50] (03PS6) 10Vlad.shapik: WP:Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) [18:40:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P41021 and previous config saved to /var/cache/conftool/dbconfig/20221124-184004-marostegui.json [18:45:08] (03PS1) 10Ssingh: setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) [18:50:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:51:09] (03CR) 10CI reject: [V: 04-1] setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:52:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi I am using a windows 10 machine and I am having trouble logging in via ssh. When I attempt to connect to the server it prompts for password/pa... [18:55:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P41022 and previous config saved to /var/cache/conftool/dbconfig/20221124-185510-marostegui.json [18:55:26] (03CR) 10Ssingh: "13:51:05 ImportError: cannot import name 'escape' from 'jinja2' (/src/.tox/py37-tests-min/lib/python3.7/site-packages/jinja2/__init__.py)" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:55:43] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8420280, @MatthewVernon wrote: > ms-be2050 looks good t... [19:02:23] (03PS2) 10Ssingh: setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) [19:03:10] (03CR) 10CI reject: [V: 04-1] setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:10:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T321126)', diff saved to https://phabricator.wikimedia.org/P41023 and previous config saved to /var/cache/conftool/dbconfig/20221124-191017-marostegui.json [19:10:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2161.codfw.wmnet with reason: Maintenance [19:10:24] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:10:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2161.codfw.wmnet with reason: Maintenance [19:10:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T321126)', diff saved to https://phabricator.wikimedia.org/P41024 and previous config saved to /var/cache/conftool/dbconfig/20221124-191038-marostegui.json [19:12:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T321126)', diff saved to https://phabricator.wikimedia.org/P41025 and previous config saved to /var/cache/conftool/dbconfig/20221124-191249-marostegui.json [19:27:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P41026 and previous config saved to /var/cache/conftool/dbconfig/20221124-192755-marostegui.json [19:43:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P41027 and previous config saved to /var/cache/conftool/dbconfig/20221124-194302-marostegui.json [19:56:15] (03PS2) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [19:58:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T321126)', diff saved to https://phabricator.wikimedia.org/P41028 and previous config saved to /var/cache/conftool/dbconfig/20221124-195808-marostegui.json [19:58:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2162.codfw.wmnet with reason: Maintenance [19:58:16] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:58:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2162.codfw.wmnet with reason: Maintenance [19:58:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T321126)', diff saved to https://phabricator.wikimedia.org/P41029 and previous config saved to /var/cache/conftool/dbconfig/20221124-195830-marostegui.json [20:00:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T321126)', diff saved to https://phabricator.wikimedia.org/P41030 and previous config saved to /var/cache/conftool/dbconfig/20221124-200040-marostegui.json [20:12:22] (03PS3) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [20:13:41] RECOVERY - cassandra-b CQL 10.64.32.31:9042 on aqs1018 is OK: TCP OK - 0.000 second response time on 10.64.32.31 port 9042 https://phabricator.wikimedia.org/T93886 [20:15:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P41031 and previous config saved to /var/cache/conftool/dbconfig/20221124-201547-marostegui.json [20:22:40] (03CR) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [20:24:26] (03PS16) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [20:24:44] (03CR) 10Raymond Ndibe: "Hello David, this needs to be +2'd by someone else. I can't merge it myself" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [20:30:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P41032 and previous config saved to /var/cache/conftool/dbconfig/20221124-203053-marostegui.json [20:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:32:44] (03PS17) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [20:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:49] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [20:46:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T321126)', diff saved to https://phabricator.wikimedia.org/P41033 and previous config saved to /var/cache/conftool/dbconfig/20221124-204600-marostegui.json [20:46:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:46:07] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:46:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T321126)', diff saved to https://phabricator.wikimedia.org/P41034 and previous config saved to /var/cache/conftool/dbconfig/20221124-204621-marostegui.json [20:48:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T321126)', diff saved to https://phabricator.wikimedia.org/P41035 and previous config saved to /var/cache/conftool/dbconfig/20221124-204832-marostegui.json [20:49:29] (03CR) 10Ssingh: [C: 04-1] "Need to revert 3.9 or add skip_missing https://phabricator.wikimedia.org/T289222 but that's for another day." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:56:41] jouncebot: nowandnext [20:56:41] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [20:56:41] In 0 hour(s) and 3 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T2100) [20:57:04] * TheresNoTime isn't available for ^ but looks empty anyway [20:57:14] (03PS18) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [21:00:04] brennen and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221124T2100). [21:02:49] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P41036 and previous config saved to /var/cache/conftool/dbconfig/20221124-210338-marostegui.json [21:18:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P41037 and previous config saved to /var/cache/conftool/dbconfig/20221124-211845-marostegui.json [21:25:01] (03PS7) 10Vlad.shapik: Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) [21:33:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T321126)', diff saved to https://phabricator.wikimedia.org/P41038 and previous config saved to /var/cache/conftool/dbconfig/20221124-213351-marostegui.json [21:33:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2164.codfw.wmnet with reason: Maintenance [21:33:59] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:34:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2164.codfw.wmnet with reason: Maintenance [21:34:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:34:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T321126)', diff saved to https://phabricator.wikimedia.org/P41039 and previous config saved to /var/cache/conftool/dbconfig/20221124-213428-marostegui.json [21:36:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T321126)', diff saved to https://phabricator.wikimedia.org/P41040 and previous config saved to /var/cache/conftool/dbconfig/20221124-213639-marostegui.json [21:38:32] (03CR) 10Vlad.shapik: "It works as expected, and the image quality is substantially better when we specify a higher DPI." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) (owner: 10Vlad.shapik) [21:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P41041 and previous config saved to /var/cache/conftool/dbconfig/20221124-215145-marostegui.json [22:02:49] Oops, got confused and pushed to github instead of gerrit ( https://github.com/wikimedia/Timestamp/tree/add-sub ), will delete the branch. [22:02:59] (logging here for lack of a better place) [22:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P41042 and previous config saved to /var/cache/conftool/dbconfig/20221124-220652-marostegui.json [22:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T321126)', diff saved to https://phabricator.wikimedia.org/P41043 and previous config saved to /var/cache/conftool/dbconfig/20221124-222158-marostegui.json [22:22:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:22:05] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [22:22:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T321126)', diff saved to https://phabricator.wikimedia.org/P41044 and previous config saved to /var/cache/conftool/dbconfig/20221124-222220-marostegui.json [22:24:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T321126)', diff saved to https://phabricator.wikimedia.org/P41045 and previous config saved to /var/cache/conftool/dbconfig/20221124-222430-marostegui.json [22:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P41046 and previous config saved to /var/cache/conftool/dbconfig/20221124-223937-marostegui.json [22:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P41047 and previous config saved to /var/cache/conftool/dbconfig/20221124-225443-marostegui.json [22:55:29] PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet has been disabled for 605047 seconds, message: T321605 - bking, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T321126)', diff saved to https://phabricator.wikimedia.org/P41048 and previous config saved to /var/cache/conftool/dbconfig/20221124-230949-marostegui.json [23:09:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:09:57] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [23:10:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:10:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P41049 and previous config saved to /var/cache/conftool/dbconfig/20221124-231011-marostegui.json [23:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P41050 and previous config saved to /var/cache/conftool/dbconfig/20221124-231221-marostegui.json [23:14:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:14:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:15:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:15:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:17:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:17:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:22:58] (03PS2) 10Andrea Denisse: admin: add dpujol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi) [23:23:36] (03CR) 10Andrea Denisse: [C: 03+2] admin: add dpujol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi) [23:23:41] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] admin: add dpujol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi) [23:25:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse) 05In progress→03Resolved Hello, @David.pujol should have access now. Please let me know if there's anything else I could help w... [23:26:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10andrea.denisse) Hi @Htriedman and @Jcross , friendly ping to confirm @dasm 's access expiry date. :) [23:27:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P41051 and previous config saved to /var/cache/conftool/dbconfig/20221124-232728-marostegui.json [23:29:32] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/860522 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [23:29:53] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/860521 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [23:31:00] (03CR) 10Andrea Denisse: o11y: more lenient logstash kafka consumer lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/860609 (owner: 10Filippo Giunchedi) [23:32:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10andrea.denisse) [23:33:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:33:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:33:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:33:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [23:36:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P41052 and previous config saved to /var/cache/conftool/dbconfig/20221124-233604-ladsgroup.json [23:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P41053 and previous config saved to /var/cache/conftool/dbconfig/20221124-234234-marostegui.json [23:51:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P41054 and previous config saved to /var/cache/conftool/dbconfig/20221124-235109-ladsgroup.json [23:57:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P41055 and previous config saved to /var/cache/conftool/dbconfig/20221124-235741-marostegui.json [23:57:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [23:57:48] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [23:57:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [23:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T321126)', diff saved to https://phabricator.wikimedia.org/P41056 and previous config saved to /var/cache/conftool/dbconfig/20221124-235803-marostegui.json