[00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3064 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:06] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:06] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:08] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3055 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:08] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:16] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3062 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:16] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3057 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:20] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3051 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:20] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3050 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:20] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:22] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:24] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:26] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3058 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:26] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:28] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:28] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:28] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:30] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:30] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:50] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:50] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3053 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:53] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:54] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:56] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:56] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:04] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3059 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:06] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:13] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:13] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:16] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3065 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:16] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3052 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:24] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3061 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:24] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3060 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:24] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:28] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:36] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:36] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:36] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:38] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3056 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:38] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3054 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:38] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:46] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:48] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:48] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:56] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3063 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:56] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:56] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:03:36] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:12:00] RECOVERY - Disk space on conf1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1007&var-datasource=eqiad+prometheus/ops [00:16:26] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:16:58] RECOVERY - Disk space on conf1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1009&var-datasource=eqiad+prometheus/ops [00:24:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:26:28] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:30:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:38] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:30:38] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:36:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:15] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:10:22] (03PS3) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [01:11:10] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:25] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [01:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:16] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:28] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:31:30] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:45] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:06] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:22] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:58] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:30:56] (03CR) 10RLazarus: slo_dashboards: move to one SLO/SLI per dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [02:44:53] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:46:53] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:04:14] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:55:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:02:48] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:08:48] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [04:54:15] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:04:58] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:00:05] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T0600). [06:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:31:26] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Marostegui) [06:32:09] (03PS1) 10Marostegui: install_server: Partmap recipe for db1206 [puppet] - 10https://gerrit.wikimedia.org/r/852653 (https://phabricator.wikimedia.org/T322256) [06:32:53] (03PS2) 10Marostegui: install_server: partman recipe for db1206 [puppet] - 10https://gerrit.wikimedia.org/r/852653 (https://phabricator.wikimedia.org/T322256) [06:33:51] (03CR) 10Marostegui: [C: 03+2] install_server: partman recipe for db1206 [puppet] - 10https://gerrit.wikimedia.org/r/852653 (https://phabricator.wikimedia.org/T322256) (owner: 10Marostegui) [06:36:23] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:41] (03PS1) 10Marostegui: mariadb: Add spare db1206 [puppet] - 10https://gerrit.wikimedia.org/r/852654 (https://phabricator.wikimedia.org/T322256) [06:38:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Add spare db1206 [puppet] - 10https://gerrit.wikimedia.org/r/852654 (https://phabricator.wikimedia.org/T322256) (owner: 10Marostegui) [06:39:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:39:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:40:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:40:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:42:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:42:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:42:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37875 and previous config saved to /var/cache/conftool/dbconfig/20221103-064225-marostegui.json [06:42:29] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:43:59] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37876 and previous config saved to /var/cache/conftool/dbconfig/20221103-064438-marostegui.json [06:45:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Marostegui) a:05Marostegui→03Jclark-ctr >>! In T322256#8364482, @RobH wrote: > @Marostegui, > > Can you populate the racking info (partitioning, network details, any r... [06:46:46] (03PS1) 10Marostegui: site.pp: Change spare with data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/852655 [06:47:07] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Marostegui) [06:47:27] (03CR) 10Marostegui: [C: 03+2] site.pp: Change spare with data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/852655 (owner: 10Marostegui) [06:50:30] (03PS1) 10Marostegui: install_server: Remove db2183, db2184 [puppet] - 10https://gerrit.wikimedia.org/r/852656 [06:50:58] (03CR) 10Marostegui: "jcrespo not sure if this can be merged, if it can, can you self serve?" [puppet] - 10https://gerrit.wikimedia.org/r/852656 (owner: 10Marostegui) [06:59:14] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P37877 and previous config saved to /var/cache/conftool/dbconfig/20221103-065946-marostegui.json [07:00:05] Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T0700) [07:00:16] morning! we have no trainees signed up for the window and no patches scheduled either. which is nice because it's 9am for me with Daylight Savings Time having hit the EU earlier than the Americas, and that's just a wee earlier than I like to be dpeloying. Not awake yet... [07:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:08:58] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment of IDM. [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P37878 and previous config saved to /var/cache/conftool/dbconfig/20221103-071455-marostegui.json [07:14:59] !log Create idm and idm_staging databases on m5 T320426 [07:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:02] T320426: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 [07:20:13] (03PS1) 10Slyngshede: C:idm::uwsgi_processes do not create directory twice. [puppet] - 10https://gerrit.wikimedia.org/r/852658 (https://phabricator.wikimedia.org/T320428) [07:21:01] (03CR) 10Slyngshede: [C: 03+2] C:idm::uwsgi_processes do not create directory twice. [puppet] - 10https://gerrit.wikimedia.org/r/852658 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:26:11] (03PS1) 10Marostegui: production-m5.sql.erb: Add idm and idm_staging users [puppet] - 10https://gerrit.wikimedia.org/r/852728 (https://phabricator.wikimedia.org/T320426) [07:28:12] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) I have created both databases on m5 (m5-master.eqiad.wmnet) The users are also there and avaiable: ` root@cumin1001:~# mysql --ssl-veri... [07:28:34] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Add idm and idm_staging users [puppet] - 10https://gerrit.wikimedia.org/r/852728 (https://phabricator.wikimedia.org/T320426) (owner: 10Marostegui) [07:28:54] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) 05Open→03Resolved I am closing this for now, reopen if you find issues. [07:28:58] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10Marostegui) [07:30:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37879 and previous config saved to /var/cache/conftool/dbconfig/20221103-073004-marostegui.json [07:30:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:30:08] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:30:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:30:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37880 and previous config saved to /var/cache/conftool/dbconfig/20221103-073028-marostegui.json [07:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37881 and previous config saved to /var/cache/conftool/dbconfig/20221103-073240-marostegui.json [07:37:56] (03PS1) 10Slyngshede: C:idm:deployment Git checkout should be in a subdir. [puppet] - 10https://gerrit.wikimedia.org/r/852733 (https://phabricator.wikimedia.org/T320428) [07:38:30] (03CR) 10CI reject: [V: 04-1] C:idm:deployment Git checkout should be in a subdir. [puppet] - 10https://gerrit.wikimedia.org/r/852733 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:39:26] (03PS2) 10Slyngshede: C:idm:deployment Git checkout should be in a subdir. [puppet] - 10https://gerrit.wikimedia.org/r/852733 (https://phabricator.wikimedia.org/T320428) [07:40:38] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment Git checkout should be in a subdir. [puppet] - 10https://gerrit.wikimedia.org/r/852733 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:47:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P37882 and previous config saved to /var/cache/conftool/dbconfig/20221103-074748-marostegui.json [07:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:55:11] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1025.eqiad.wmnet with reason: Remove from cluster for eventual reimage [07:55:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:55:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1025.eqiad.wmnet with reason: Remove from cluster for eventual reimage [07:58:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bullseye [07:58:25] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye [08:00:05] jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T0800). [08:01:48] !log installing exim4 security updates [08:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P37883 and previous config saved to /var/cache/conftool/dbconfig/20221103-080257-marostegui.json [08:04:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nicely done re: addressing alert fatigue" [alerts] - 10https://gerrit.wikimedia.org/r/852206 (https://phabricator.wikimedia.org/T322220) (owner: 10Ssingh) [08:13:54] (03CR) 10Jcrespo: "Apparently, these hosts have been mislabeled as owned by data engineering: 4697cc9bdac" [puppet] - 10https://gerrit.wikimedia.org/r/852656 (owner: 10Marostegui) [08:17:18] !log installing glibc security updates on buster [08:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37884 and previous config saved to /var/cache/conftool/dbconfig/20221103-081805-marostegui.json [08:18:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:18:08] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:18:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:18:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T321123)', diff saved to https://phabricator.wikimedia.org/P37885 and previous config saved to /var/cache/conftool/dbconfig/20221103-081827-marostegui.json [08:20:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321123)', diff saved to https://phabricator.wikimedia.org/P37886 and previous config saved to /var/cache/conftool/dbconfig/20221103-082040-marostegui.json [08:22:34] (03PS2) 10Jcrespo: install_server: remove db2183, db2184 from db recipe, correct owner [puppet] - 10https://gerrit.wikimedia.org/r/852656 (owner: 10Marostegui) [08:25:03] (03CR) 10Jcrespo: [C: 03+2] install_server: remove db2183, db2184 from db recipe, correct owner [puppet] - 10https://gerrit.wikimedia.org/r/852656 (owner: 10Marostegui) [08:35:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P37887 and previous config saved to /var/cache/conftool/dbconfig/20221103-083549-marostegui.json [08:37:22] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:37:27] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [08:37:28] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:39:51] !log installing ruby-nokogiri security updates [08:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:34] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1025.eqiad.wmnet with OS bullseye [08:43:38] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors: - ganeti1025 (**FAIL**) - D... [08:44:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bullseye [08:44:31] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye [08:51:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P37888 and previous config saved to /var/cache/conftool/dbconfig/20221103-085059-marostegui.json [08:53:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1025.eqiad.wmnet with OS bullseye [08:53:33] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors: - ganeti1025 (**FAIL**) - R... [08:53:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bullseye [08:53:49] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye [08:56:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [08:59:35] (03PS2) 10Phuedx: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [09:01:24] (03CR) 10Phuedx: Add config for Visual Editor Feature Use instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [09:02:01] (03PS3) 10Phuedx: testwiki: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [09:02:03] (03PS4) 10Phuedx: testwiki: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [09:02:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1025.eqiad.wmnet with OS bullseye [09:02:12] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors: - ganeti1025 (**FAIL**) - R... [09:02:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bullseye [09:02:31] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye [09:05:26] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [09:05:26] (03CR) 10David Caro: [C: 03+2] puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 (owner: 10David Caro) [09:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321123)', diff saved to https://phabricator.wikimedia.org/P37889 and previous config saved to /var/cache/conftool/dbconfig/20221103-090607-marostegui.json [09:06:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:06:10] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:06:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T321123)', diff saved to https://phabricator.wikimedia.org/P37890 and previous config saved to /var/cache/conftool/dbconfig/20221103-090631-marostegui.json [09:08:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321123)', diff saved to https://phabricator.wikimedia.org/P37891 and previous config saved to /var/cache/conftool/dbconfig/20221103-090844-marostegui.json [09:10:02] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] dispatch: refactor/simplify db profile [puppet] - 10https://gerrit.wikimedia.org/r/851693 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [09:12:38] (03CR) 10Elukey: [C: 03+1] Rename ml_k8s staging roles to match naming scheme [labs/private] - 10https://gerrit.wikimedia.org/r/852196 (owner: 10JMeybohm) [09:13:58] (03CR) 10Elukey: [C: 03+1] "The change looks good to me! Tobias should be able to roll it out during the next few days." [labs/private] - 10https://gerrit.wikimedia.org/r/852196 (owner: 10JMeybohm) [09:14:18] (03CR) 10Phuedx: [C: 03+1] testwiki: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [09:23:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P37892 and previous config saved to /var/cache/conftool/dbconfig/20221103-092353-marostegui.json [09:24:14] (03PS1) 10Volans: dns: skip mgmt records for decommissioning devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852738 (https://phabricator.wikimedia.org/T320721) [09:26:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [09:32:34] (03PS2) 10Jbond: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 [09:32:46] (03CR) 10Elukey: [C: 03+1] "Checked the new file in /etc and the diff with DAEMON_ARGS, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [09:36:15] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1025.eqiad.wmnet'] [09:36:23] (03CR) 10Volans: "question inline while waiting for jenkins ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [09:38:11] (03CR) 10Elukey: Add a namespace for the stream-enrichment-poc on dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [09:38:26] (03PS1) 10Volans: sre.hosts.decommission: unset mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/852739 (https://phabricator.wikimedia.org/T320721) [09:39:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P37893 and previous config saved to /var/cache/conftool/dbconfig/20221103-093901-marostegui.json [09:39:10] (03CR) 10Elukey: [C: 03+1] Rename ml_k8s staging roles to match naming scheme (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/852196 (owner: 10JMeybohm) [09:39:34] (03CR) 10Elukey: [C: 03+1] "The change looks good to me! Tobias should be able to roll it out during the next few days." [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [09:42:36] (03CR) 10Volans: [C: 03+1] "LGTM, but I'm unsure about one bit, see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [09:46:13] (03PS1) 10Vgutierrez: cluster: Add deployment-prep swift cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/852740 (https://phabricator.wikimedia.org/T322231) [09:48:25] (03CR) 10Majavah: [C: 04-1] cluster: Add deployment-prep swift cluster definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852740 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [09:49:44] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.4 [puppet] - 10https://gerrit.wikimedia.org/r/852741 (https://phabricator.wikimedia.org/T322289) [09:49:56] (03Abandoned) 10Vgutierrez: cluster: Add deployment-prep swift cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/852740 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [09:50:48] (03CR) 10JMeybohm: [C: 03+2] Pin cert-manager and cfssl-issuer chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/838134 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [09:51:02] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Hibashaath - https://phabricator.wikimedia.org/T321902 (10KCVelaga_WMF) [09:51:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Hghani - https://phabricator.wikimedia.org/T321910 (10KCVelaga_WMF) [09:51:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Ilooremeta - https://phabricator.wikimedia.org/T321918 (10KCVelaga_WMF) [09:51:57] (03PS1) 10Btullis: Update data.yaml to reflect that sstefanova is kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/852743 (https://phabricator.wikimedia.org/T320253) [09:52:11] (03CR) 10JMeybohm: cfssl-issuer: Move from single to multiple files for CRDs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [09:54:05] (03CR) 10Btullis: [C: 03+2] Update data.yaml to reflect that sstefanova is kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/852743 (https://phabricator.wikimedia.org/T320253) (owner: 10Btullis) [09:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321123)', diff saved to https://phabricator.wikimedia.org/P37894 and previous config saved to /var/cache/conftool/dbconfig/20221103-095409-marostegui.json [09:54:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:54:13] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:54:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:54:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:54:38] (03Merged) 10jenkins-bot: Pin cert-manager and cfssl-issuer chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/838134 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [09:54:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:55:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T321123)', diff saved to https://phabricator.wikimedia.org/P37895 and previous config saved to /var/cache/conftool/dbconfig/20221103-095501-marostegui.json [09:55:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10KCVelaga_WMF) a:05HShaath-WMF→03None [09:56:02] (03CR) 10JMeybohm: cfssl-issuer: Bump CRD chart version for cfssl-issuer update (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm) [09:56:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10KCVelaga_WMF) [09:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321123)', diff saved to https://phabricator.wikimedia.org/P37896 and previous config saved to /var/cache/conftool/dbconfig/20221103-095715-marostegui.json [09:59:20] (03PS9) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [09:59:22] (03PS28) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [09:59:24] (03PS19) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [09:59:26] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [09:59:36] (03PS1) 10Slyngshede: C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) [10:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1000) [10:00:12] (03CR) 10CI reject: [V: 04-1] C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:01:59] (03PS2) 10Slyngshede: C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) [10:02:33] (03CR) 10CI reject: [V: 04-1] C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:02:37] (03PS3) 10Slyngshede: data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) [10:03:53] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [10:04:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @nskaggs yes, we'll get dbproxy1019 for free, and after that I can depool dbproxy1018 and reboot that one as well. [10:04:15] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [10:04:25] (03PS3) 10Slyngshede: C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) [10:04:27] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [10:07:06] (03PS4) 10JMeybohm: Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) [10:07:32] (03PS5) 10JMeybohm: Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) [10:08:31] (03PS1) 10Vgutierrez: puppetmaster::standalone: Sync swift rings [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) [10:09:20] (03PS4) 10Slyngshede: C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) [10:09:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "my first impulse would have been to put this in the package though." [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro) [10:11:04] (03CR) 10JMeybohm: [C: 03+2] Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [10:12:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P37897 and previous config saved to /var/cache/conftool/dbconfig/20221103-101222-marostegui.json [10:12:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti1025.eqiad.wmnet'] [10:18:27] (03PS10) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [10:18:29] (03PS29) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [10:19:23] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1025.eqiad.wmnet'] [10:20:17] (03PS20) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:24:27] (03PS21) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:25:36] (03PS30) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [10:26:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/852741 (https://phabricator.wikimedia.org/T322289) (owner: 10Jelto) [10:27:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P37898 and previous config saved to /var/cache/conftool/dbconfig/20221103-102730-marostegui.json [10:29:23] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [10:29:41] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.4 [puppet] - 10https://gerrit.wikimedia.org/r/852741 (https://phabricator.wikimedia.org/T322289) (owner: 10Jelto) [10:29:46] (03PS22) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:30:58] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [10:33:15] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [10:33:41] (03CR) 10Jbond: "After some rebasing hell this and the preceding 3 changes are ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [10:34:31] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37929/console" [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [10:37:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852256 (owner: 10Dzahn) [10:38:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! See inline comment about rotation, but let's do that as a followup patch" [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:38:27] (03CR) 10Jbond: [C: 03+1] "lgtm but see inline question" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [10:39:08] (03PS2) 10Jbond: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [10:39:14] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:39:16] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [10:39:27] (03CR) 10Jbond: [C: 03+1] "lgtm, also kicked of a pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [10:40:59] (03CR) 10Jbond: "i dont remember this either and considering its from feb i think it can be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/765629 (owner: 10Jbond) [10:41:03] (03Abandoned) 10Jbond: Revert "C:package_builder: install tools to build node packages" [puppet] - 10https://gerrit.wikimedia.org/r/765629 (owner: 10Jbond) [10:42:13] (03CR) 10Vgutierrez: [V: 03+1] "PCC looks happy: https://puppet-compiler.wmflabs.org/pcc-worker1002/37929/" [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [10:42:31] (03CR) 10Klausman: [C: 03+1] Rename ml_k8s staging roles to match naming scheme [labs/private] - 10https://gerrit.wikimedia.org/r/852196 (owner: 10JMeybohm) [10:42:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321123)', diff saved to https://phabricator.wikimedia.org/P37899 and previous config saved to /var/cache/conftool/dbconfig/20221103-104239-marostegui.json [10:42:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:42:42] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:43:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:43:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37900 and previous config saved to /var/cache/conftool/dbconfig/20221103-104313-marostegui.json [10:43:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852738 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [10:44:00] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:44:34] (03CR) 10Vgutierrez: puppetmaster::standalone: Sync swift rings [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [10:48:12] (03CR) 10Volans: [C: 03+1] "LGTM, minor nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [10:48:43] (03PS3) 10Jbond: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 [10:48:55] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [10:49:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [10:49:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:49:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [10:49:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:49:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:49:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/852739 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [10:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P37901 and previous config saved to /var/cache/conftool/dbconfig/20221103-104942-ladsgroup.json [10:49:45] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:49:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:49:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:49:54] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:49:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P37902 and previous config saved to /var/cache/conftool/dbconfig/20221103-104957-ladsgroup.json [10:50:00] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:50:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:50:17] (03CR) 10Majavah: [C: 04-1] puppetmaster::standalone: Sync swift rings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [10:50:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [10:50:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [10:51:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [10:51:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [10:51:46] (03CR) 10Volans: [C: 03+1] "LGTM, just a typo in the help" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [10:51:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T318955)', diff saved to https://phabricator.wikimedia.org/P37903 and previous config saved to /var/cache/conftool/dbconfig/20221103-105148-ladsgroup.json [10:51:52] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:52:02] (03PS2) 10Btullis: Add a namespace for the stream-enrichment-poc on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) [10:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P37904 and previous config saved to /var/cache/conftool/dbconfig/20221103-105243-ladsgroup.json [10:53:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1025.eqiad.wmnet'] [10:53:32] (03PS11) 10Jbond: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [10:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318955)', diff saved to https://phabricator.wikimedia.org/P37905 and previous config saved to /var/cache/conftool/dbconfig/20221103-105429-ladsgroup.json [10:55:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37906 and previous config saved to /var/cache/conftool/dbconfig/20221103-105527-marostegui.json [10:55:31] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:56:30] (03PS1) 10Jbond: P:spicerack: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/852771 [10:56:34] (03PS12) 10Jbond: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [10:56:53] (03CR) 10Jbond: "thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [10:57:21] (03PS31) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [10:57:26] (03PS32) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [10:57:34] (03PS23) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:57:38] (03PS7) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [10:57:40] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment add django configuration. [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:57:58] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment add django configuration. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852744 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:58:23] (03CR) 10Volans: "LGTM, some nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [10:59:27] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [10:59:54] (03PS1) 10Muehlenhoff: sre.hardware.upgrade-firmware: Fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/852773 [11:01:01] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/852773 (owner: 10Muehlenhoff) [11:02:25] (03CR) 10Volans: [C: 03+1] "LGTM, make sure to test it as this is now used by a lot of cookbooks :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [11:03:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/852771 (owner: 10Jbond) [11:03:41] (03PS1) 10Clément Goubert: P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) [11:03:55] (03CR) 10Volans: [C: 03+2] dns: skip mgmt records for decommissioning devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852738 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [11:04:08] (03PS1) 10Elukey: Add golang 1.19 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852776 (https://phabricator.wikimedia.org/T322193) [11:04:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1025.eqiad.wmnet with OS bullseye [11:04:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors: - ganeti1025 (**FAIL**) - R... [11:04:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bullseye [11:04:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye [11:05:16] (03Merged) 10jenkins-bot: dns: skip mgmt records for decommissioning devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852738 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [11:05:48] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37930/console" [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [11:06:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:06:44] !log volans@cumin1001 START - Cookbook sre.dns.netbox [11:07:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P37907 and previous config saved to /var/cache/conftool/dbconfig/20221103-110751-ladsgroup.json [11:08:44] (03PS24) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [11:08:46] (03CR) 10Jbond: "update thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [11:08:56] (03PS8) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [11:09:22] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:09:27] (03PS1) 10Clément Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [11:09:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P37908 and previous config saved to /var/cache/conftool/dbconfig/20221103-110939-ladsgroup.json [11:10:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P37909 and previous config saved to /var/cache/conftool/dbconfig/20221103-111037-marostegui.json [11:11:02] (03PS2) 10Vgutierrez: puppetmaster::standalone: Sync swift rings [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) [11:11:04] (03CR) 10Jbond: [C: 03+2] P:spicerack: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/852771 (owner: 10Jbond) [11:11:34] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:12:31] (03CR) 10Elukey: "Left some comments about the versioning, lemme know your thoughts!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [11:13:27] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37932/console" [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [11:13:30] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:13:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [11:13:54] (03CR) 10Vgutierrez: [V: 03+1] puppetmaster::standalone: Sync swift rings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [11:14:14] (03PS2) 10Clément Goubert: P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) [11:14:16] (03PS2) 10Clément Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [11:16:13] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [11:17:01] (03CR) 10David Caro: puppet_compiler.differ: add support to filter by core type (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [11:17:20] (03PS2) 10Jbond: R:swift::label_filesystem: jst check that any lable is on the disk [puppet] - 10https://gerrit.wikimedia.org/r/849595 (https://phabricator.wikimedia.org/T308677) [11:17:33] (03CR) 10Jbond: R:swift::label_filesystem: jst check that any lable is on the disk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849595 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [11:17:35] (03CR) 10David Caro: [C: 03+1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [11:18:26] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [11:19:12] (03CR) 10David Caro: Add upgrade_openstack_node.py (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [11:23:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P37910 and previous config saved to /var/cache/conftool/dbconfig/20221103-112300-ladsgroup.json [11:23:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [11:23:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [11:23:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37911 and previous config saved to /var/cache/conftool/dbconfig/20221103-112343-ladsgroup.json [11:23:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:23:59] (03CR) 10Btullis: Add a namespace for the stream-enrichment-poc on dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [11:24:26] (03PS3) 10Clément Goubert: P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) [11:24:29] (03PS3) 10Clément Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [11:24:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [11:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P37912 and previous config saved to /var/cache/conftool/dbconfig/20221103-112448-ladsgroup.json [11:25:02] (03CR) 10CI reject: [V: 04-1] P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [11:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P37913 and previous config saved to /var/cache/conftool/dbconfig/20221103-112546-marostegui.json [11:27:38] (03PS4) 10Clément Goubert: P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) [11:27:40] (03PS4) 10Clément Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [11:28:01] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1025.mgmt.eqiad.wmnet with reboot policy GRACEFUL [11:28:59] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P37914 and previous config saved to /var/cache/conftool/dbconfig/20221103-112900-ladsgroup.json [11:29:03] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:31:45] (03PS3) 10Vgutierrez: puppetmaster::standalone: Sync swift rings [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) [11:33:59] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37936/console" [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [11:34:03] (03PS5) 10Clément Goubert: P:kubernetes::deployment_server: absent services [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) [11:34:05] (03PS5) 10Clément Goubert: mwdebug: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) [11:34:46] (03PS1) 10Klausman: wikilabels: Cleanup old DB proxy information [puppet] - 10https://gerrit.wikimedia.org/r/852779 (https://phabricator.wikimedia.org/T307389) [11:35:22] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37937/console" [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [11:35:38] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37938/console" [puppet] - 10https://gerrit.wikimedia.org/r/852779 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [11:35:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1025.eqiad.wmnet with OS bullseye [11:35:45] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors: - ganeti1025 (**FAIL**) - R... [11:36:17] (03CR) 10MVernon: [C: 03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/849595 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [11:37:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [11:37:39] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1025.mgmt.eqiad.wmnet with reboot policy GRACEFUL [11:38:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P37916 and previous config saved to /var/cache/conftool/dbconfig/20221103-113809-ladsgroup.json [11:38:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:38:13] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:38:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:38:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P37917 and previous config saved to /var/cache/conftool/dbconfig/20221103-113833-ladsgroup.json [11:39:20] (03CR) 10Klausman: [C: 03+1] "Not quite sure if the backports repo config should be limited to more than "*", but it likely doesn't make a difference here." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852776 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [11:39:22] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:39:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bullseye [11:39:27] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye [11:39:39] (03CR) 10Jbond: [C: 03+2] R:swift::label_filesystem: jst check that any lable is on the disk [puppet] - 10https://gerrit.wikimedia.org/r/849595 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [11:39:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318955)', diff saved to https://phabricator.wikimedia.org/P37918 and previous config saved to /var/cache/conftool/dbconfig/20221103-113956-ladsgroup.json [11:39:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:40:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T318955)', diff saved to https://phabricator.wikimedia.org/P37919 and previous config saved to /var/cache/conftool/dbconfig/20221103-114021-ladsgroup.json [11:40:24] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37939/console" [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [11:40:33] (03CR) 10Klausman: [C: 03+1] "LGTM from me as well, with the same caveat as Luca's" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [11:40:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321123)', diff saved to https://phabricator.wikimedia.org/P37920 and previous config saved to /var/cache/conftool/dbconfig/20221103-114054-marostegui.json [11:40:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:40:57] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:41:06] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] puppetmaster::standalone: Sync swift rings [puppet] - 10https://gerrit.wikimedia.org/r/852767 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [11:41:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:41:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:41:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P37921 and previous config saved to /var/cache/conftool/dbconfig/20221103-114116-ladsgroup.json [11:41:26] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:41:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:41:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T321123)', diff saved to https://phabricator.wikimedia.org/P37922 and previous config saved to /var/cache/conftool/dbconfig/20221103-114135-marostegui.json [11:42:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [11:42:36] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: ensure we create all directories [cookbooks] - 10https://gerrit.wikimedia.org/r/852782 [11:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318955)', diff saved to https://phabricator.wikimedia.org/P37923 and previous config saved to /var/cache/conftool/dbconfig/20221103-114304-ladsgroup.json [11:44:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P37924 and previous config saved to /var/cache/conftool/dbconfig/20221103-114408-ladsgroup.json [11:46:16] (03PS1) 10Slyngshede: P:idm Add missing project variable. [puppet] - 10https://gerrit.wikimedia.org/r/852784 (https://phabricator.wikimedia.org/T320428) [11:46:45] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [11:47:57] (03CR) 10Clément Goubert: [V: 03+1] "This should allow cleaning up services. An example in https://gerrit.wikimedia.org/r/c/operations/puppet/+/852777" [puppet] - 10https://gerrit.wikimedia.org/r/852775 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [11:48:45] (03CR) 10Clément Goubert: [V: 03+1] "Cleaning up mwdebug service files on deployment server" [puppet] - 10https://gerrit.wikimedia.org/r/852777 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [11:50:27] (03CR) 10Slyngshede: [C: 03+2] P:idm Add missing project variable. [puppet] - 10https://gerrit.wikimedia.org/r/852784 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [11:51:33] (03PS2) 10Muehlenhoff: kubernetes: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842765 (https://phabricator.wikimedia.org/T308013) [11:51:58] (03CR) 10Btullis: Update the spark and spark-operator images (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [11:52:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1025.eqiad.wmnet with reason: host reimage [11:55:24] (03CR) 10JMeybohm: [C: 03+1] Add golang 1.19 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852776 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [11:55:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:55:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1025.eqiad.wmnet with reason: host reimage [11:56:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P37928 and previous config saved to /var/cache/conftool/dbconfig/20221103-115624-ladsgroup.json [11:58:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P37929 and previous config saved to /var/cache/conftool/dbconfig/20221103-115813-ladsgroup.json [11:58:21] (03PS1) 10Volans: dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) [11:59:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P37930 and previous config saved to /var/cache/conftool/dbconfig/20221103-115916-ladsgroup.json [11:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37931 and previous config saved to /var/cache/conftool/dbconfig/20221103-115928-ladsgroup.json [11:59:31] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:05:14] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: unset mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/852739 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [12:05:46] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:06:45] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:07:22] (03PS1) 10Clément Goubert: mw-debug: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852809 (https://phabricator.wikimedia.org/T321201) [12:07:47] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:08:07] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [12:08:31] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:08:33] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:09:40] (03CR) 10Muehlenhoff: [C: 03+2] kubernetes: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842765 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:10:20] (03CR) 10Volans: [C: 03+1] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [12:10:29] (03Merged) 10jenkins-bot: sre.hosts.decommission: unset mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/852739 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [12:10:48] (03PS2) 10Clément Goubert: mw-debug: Add dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852809 (https://phabricator.wikimedia.org/T321201) [12:11:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1025.eqiad.wmnet with OS bullseye [12:11:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye completed: - ganeti1025 (**PASS**) - Removed from... [12:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P37932 and previous config saved to /var/cache/conftool/dbconfig/20221103-121133-ladsgroup.json [12:12:19] (03PS1) 10Clément Goubert: mwdebug: Remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/852811 (https://phabricator.wikimedia.org/T321201) [12:13:09] (03PS2) 10Clément Goubert: mwdebug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201) [12:13:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P37934 and previous config saved to /var/cache/conftool/dbconfig/20221103-121320-ladsgroup.json [12:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P37935 and previous config saved to /var/cache/conftool/dbconfig/20221103-121423-ladsgroup.json [12:14:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:14:27] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P37936 and previous config saved to /var/cache/conftool/dbconfig/20221103-121436-ladsgroup.json [12:14:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:14:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T318605)', diff saved to https://phabricator.wikimedia.org/P37937 and previous config saved to /var/cache/conftool/dbconfig/20221103-121458-ladsgroup.json [12:15:10] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:15:12] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321123)', diff saved to https://phabricator.wikimedia.org/P37938 and previous config saved to /var/cache/conftool/dbconfig/20221103-121553-marostegui.json [12:15:56] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:16:08] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:16:10] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:16:59] 10SRE, 10Infrastructure-Foundations, 10Traffic: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10BBlack) 05Open→03Resolved a:03BBlack Should've been resolved a while back! [12:17:25] (03Abandoned) 10BBlack: Add drmrs site instances [puppet] - 10https://gerrit.wikimedia.org/r/692869 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:17:33] (03Abandoned) 10BBlack: conftool-data/node: Add drmrs nodes [puppet] - 10https://gerrit.wikimedia.org/r/692331 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:17:45] (03Abandoned) 10BBlack: hieradata: Add drmrs domain to puppet master allow list [puppet] - 10https://gerrit.wikimedia.org/r/692332 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:17:54] (03Abandoned) 10BBlack: hieradata/cloud: Add drmrs to ntp peers list [puppet] - 10https://gerrit.wikimedia.org/r/692333 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:17:57] (03PS1) 10Muehlenhoff: sre.misc-clusters.roll-restart-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/852814 [12:19:18] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/852815 (owner: 10L10n-bot) [12:20:32] 10SRE, 10Traffic, 10Patch-For-Review: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10BBlack) Bump - we should revisit this, but perhaps after finishing the cache role name cleanup (text vs text_envoy vs text_haproxy...). [12:20:46] (03CR) 10BBlack: [C: 03+2] Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [12:22:20] (03CR) 10Volans: [C: 03+1] "LGTM, formatting nit inline no need for re-review" [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [12:22:37] (03CR) 10JMeybohm: [C: 03+1] sre.misc-clusters.roll-restart-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/852814 (owner: 10Muehlenhoff) [12:23:08] (03CR) 10Muehlenhoff: [C: 03+2] sre.misc-clusters.roll-restart-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/852814 (owner: 10Muehlenhoff) [12:23:20] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3061 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494323 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:23:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/852782 (owner: 10Jbond) [12:23:29] recoveries incoming :) [12:23:39] (03Abandoned) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/832591 (owner: 10Muehlenhoff) [12:24:02] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3061 is OK: SSL OK - OCSP staple validity for wikipedia.org has 509040 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:24:34] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5003 is OK: SSL OK - OCSP staple validity for wikipedia.org has 509008 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:24:37] yeah there will be about... 96 of those over the next agent run window :) [12:25:14] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5003 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494209 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:25:18] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6007 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494205 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:25:33] * claime braces for spam [12:25:40] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6012 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508943 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:25:50] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494173 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:25:52] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5002 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508930 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:25:54] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508928 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:01] silencing doesn't affect recovery IIRC [12:26:08] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3051 is OK: SSL OK - OCSP staple validity for wikipedia.org has 580554 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:08] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3056 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494154 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:10] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3058 is OK: SSL OK - OCSP staple validity for wikipedia.org has 580552 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:20] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3051 is OK: SSL OK - OCSP staple validity for wikipedia.org has 595304 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:20] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3056 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508903 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:20] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3058 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508903 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:22] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5002 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494140 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:32] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6012 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494130 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P37939 and previous config saved to /var/cache/conftool/dbconfig/20221103-122640-ladsgroup.json [12:26:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:26:44] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:26:47] bblack: I think you're right, I remember running into the same issue at $JOB~1 [12:26:50] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6007 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508873 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:26:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:26:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:27:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:27:08] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5016 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508854 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:27:08] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5016 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494094 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:27:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T318955)', diff saved to https://phabricator.wikimedia.org/P37940 and previous config saved to /var/cache/conftool/dbconfig/20221103-122709-ladsgroup.json [12:27:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [12:27:12] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6004 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508850 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:27:24] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3054 is OK: SSL OK - OCSP staple validity for wikipedia.org has 580479 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:27:36] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6004 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494067 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:27:36] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6015 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508827 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:27:54] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6015 is OK: SSL OK - OCSP staple validity for wikipedia.org has 494049 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:28:14] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3054 is OK: SSL OK - OCSP staple validity for wikipedia.org has 595190 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318955)', diff saved to https://phabricator.wikimedia.org/P37941 and previous config saved to /var/cache/conftool/dbconfig/20221103-122831-ladsgroup.json [12:28:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:28:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:28:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T318955)', diff saved to https://phabricator.wikimedia.org/P37942 and previous config saved to /var/cache/conftool/dbconfig/20221103-122854-ladsgroup.json [12:29:39] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6011 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493943 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:29:43] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6014 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508699 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:29:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P37943 and previous config saved to /var/cache/conftool/dbconfig/20221103-122944-ladsgroup.json [12:29:53] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6014 is OK: SSL OK - OCSP staple validity for wikipedia.org has 580329 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:29:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6011 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508689 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:30:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3055 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508656 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:30:43] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3064 is OK: SSL OK - OCSP staple validity for wikipedia.org has 580279 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:30:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318955)', diff saved to https://phabricator.wikimedia.org/P37944 and previous config saved to /var/cache/conftool/dbconfig/20221103-123048-ladsgroup.json [12:31:00] (03PS1) 10Slyngshede: C:idm::deployment wrongly named template var. [puppet] - 10https://gerrit.wikimedia.org/r/852826 (https://phabricator.wikimedia.org/T320428) [12:31:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P37945 and previous config saved to /var/cache/conftool/dbconfig/20221103-123101-marostegui.json [12:31:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318955)', diff saved to https://phabricator.wikimedia.org/P37946 and previous config saved to /var/cache/conftool/dbconfig/20221103-123137-ladsgroup.json [12:31:43] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5007 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493819 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:31:54] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment wrongly named template var. [puppet] - 10https://gerrit.wikimedia.org/r/852826 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:33:43] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6001 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508460 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:34:39] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6003 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493643 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:34:51] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6005 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493631 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:34:55] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5008 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508388 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:34:59] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6005 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508383 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:35:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [12:36:23] (03PS10) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [12:36:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3063 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508296 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:36:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3064 is OK: SSL OK - OCSP staple validity for wikipedia.org has 594697 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:36:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5005 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508295 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:36:29] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579933 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:36:57] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508265 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:37:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5012 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508210 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:38:25] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5009 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493417 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:38:35] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3060 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579807 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:38:41] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3063 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493401 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:38:43] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5005 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579799 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:38:43] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6003 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508159 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:39:17] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5011 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493365 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:39:17] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5009 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508125 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:40:21] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6002 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579702 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:40:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T318605)', diff saved to https://phabricator.wikimedia.org/P37947 and previous config saved to /var/cache/conftool/dbconfig/20221103-124047-ladsgroup.json [12:40:56] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:41:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6002 is OK: SSL OK - OCSP staple validity for wikipedia.org has 508009 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:41:37] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [12:41:43] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3065 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493219 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:41:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3065 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507977 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:41:49] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5006 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493214 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:41:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6008 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507969 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:42:01] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3062 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579602 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:42:32] (03PS1) 10Slyngshede: P:idm Fix missing configuration dir. [puppet] - 10https://gerrit.wikimedia.org/r/852828 (https://phabricator.wikimedia.org/T320428) [12:43:05] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5015 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579537 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:43:11] (03CR) 10Hashar: [C: 03+1] Provide current $PATH to the verify script (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [12:43:15] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5015 is OK: SSL OK - OCSP staple validity for wikipedia.org has 594288 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:43:17] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3059 is OK: SSL OK - OCSP staple validity for wikipedia.org has 594286 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:43:19] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3059 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579523 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:43:36] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [12:43:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3057 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507861 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:43:50] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [12:44:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [12:44:37] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3057 is OK: SSL OK - OCSP staple validity for wikipedia.org has 493046 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:44:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37948 and previous config saved to /var/cache/conftool/dbconfig/20221103-124454-ladsgroup.json [12:44:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:45:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:45:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T318605)', diff saved to https://phabricator.wikimedia.org/P37949 and previous config saved to /var/cache/conftool/dbconfig/20221103-124516-ladsgroup.json [12:45:52] (03CR) 10Slyngshede: [C: 03+2] P:idm Fix missing configuration dir. [puppet] - 10https://gerrit.wikimedia.org/r/852828 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:45:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3062 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507729 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P37950 and previous config saved to /var/cache/conftool/dbconfig/20221103-124557-ladsgroup.json [12:45:59] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6013 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492963 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P37951 and previous config saved to /var/cache/conftool/dbconfig/20221103-124607-marostegui.json [12:46:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507714 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:46:33] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 579329 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:46:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P37952 and previous config saved to /var/cache/conftool/dbconfig/20221103-124646-ladsgroup.json [12:48:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6006 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507589 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:48:21] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3055 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492821 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:48:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5004 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507579 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:49:16] (03PS1) 10Slyngshede: P:idm Apache2 modules are named with _ not - [puppet] - 10https://gerrit.wikimedia.org/r/852829 (https://phabricator.wikimedia.org/T320428) [12:49:25] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6006 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492758 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:50:43] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5004 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492680 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:50:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6013 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507438 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:50:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6016 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507438 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:51:31] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6009 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507391 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:51:36] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for haveged [puppet] - 10https://gerrit.wikimedia.org/r/852830 (https://phabricator.wikimedia.org/T135991) [12:52:00] (03CR) 10Slyngshede: [C: 03+2] P:idm Apache2 modules are named with _ not - [puppet] - 10https://gerrit.wikimedia.org/r/852829 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:52:27] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6009 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492575 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:53:15] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6016 is OK: SSL OK - OCSP staple validity for wikipedia.org has 578927 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:55:29] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3050 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507153 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:55:29] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3053 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507153 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:55:33] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6008 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492389 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:55:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P37953 and previous config saved to /var/cache/conftool/dbconfig/20221103-125555-ladsgroup.json [12:56:23] (03CR) 10JMeybohm: [C: 04-1] Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [12:56:25] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for jwt-authorizer on docker registry [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) [12:57:51] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3050 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492252 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:57:51] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3053 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492252 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:57:51] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3060 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507012 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:57:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5013 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507009 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:57:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5014 is OK: SSL OK - OCSP staple validity for wikipedia.org has 507009 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:46] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5006 is OK: SSL OK - OCSP staple validity for wikipedia.org has 506896 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:46] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5007 is OK: SSL OK - OCSP staple validity for wikipedia.org has 506896 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:48] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6001 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492135 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:48] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5012 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492135 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:48] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5013 is OK: SSL OK - OCSP staple validity for wikipedia.org has 578535 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:48] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5008 is OK: SSL OK - OCSP staple validity for wikipedia.org has 492135 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:48] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5014 is OK: SSL OK - OCSP staple validity for wikipedia.org has 578535 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 372 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:59:49] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5011 is OK: SSL OK - OCSP staple validity for wikipedia.org has 506895 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-17 23:59:59 +0000 (expires in 379 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1300). [13:00:05] Daimona, HouseOfM, cmelo, and arlolra: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) @fnegri i have not gotten confirmation that dbproxy1019 is depooled yet any update? [13:00:38] o/ [13:00:44] o/ [13:00:47] o/ [13:00:49] (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for contint* to ServiceOps [puppet] - 10https://gerrit.wikimedia.org/r/852832 [13:00:52] o/ [13:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P37954 and previous config saved to /var/cache/conftool/dbconfig/20221103-130106-ladsgroup.json [13:01:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321123)', diff saved to https://phabricator.wikimedia.org/P37955 and previous config saved to /var/cache/conftool/dbconfig/20221103-130117-marostegui.json [13:01:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:01:21] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:01:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321123)', diff saved to https://phabricator.wikimedia.org/P37956 and previous config saved to /var/cache/conftool/dbconfig/20221103-130140-marostegui.json [13:01:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P37957 and previous config saved to /var/cache/conftool/dbconfig/20221103-130153-ladsgroup.json [13:02:28] Who's gonna be our brave deployer today? [13:02:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1025.eqiad.wmnet to cluster eqiad and group A [13:02:54] I can deploy, I guess [13:03:21] (03PS2) 10Btullis: Update the spark and spark-operator images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) [13:03:38] Cool, thanks Lucas. [13:03:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321123)', diff saved to https://phabricator.wikimedia.org/P37958 and previous config saved to /var/cache/conftool/dbconfig/20221103-130352-marostegui.json [13:04:36] (03PS3) 10Lucas Werkmeister (WMDE): Remove $wgCampaignEventsDatabaseName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851078 (https://phabricator.wikimedia.org/T318592) (owner: 10Daimona Eaytoy) [13:04:47] (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:05:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851078 (https://phabricator.wikimedia.org/T318592) (owner: 10Daimona Eaytoy) [13:05:19] cmelo, HouseOfM: should each of us test on a target wiki? [13:06:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1025.eqiad.wmnet to cluster eqiad and group A [13:06:11] Just tell me which one you'd like to pick. [13:06:15] (03Merged) 10jenkins-bot: Remove $wgCampaignEventsDatabaseName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851078 (https://phabricator.wikimedia.org/T318592) (owner: 10Daimona Eaytoy) [13:06:19] Yes, i think so Daimona [13:06:49] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:851078|Remove $wgCampaignEventsDatabaseName (T318592)]] [13:06:52] T318592: Deploy the CampaignEvents extension to production (testwiki, test2wiki, officewiki) - https://phabricator.wikimedia.org/T318592 [13:07:01] (03PS3) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 [13:07:10] I'll take Test [13:07:15] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and daimona: Backport for [[gerrit:851078|Remove $wgCampaignEventsDatabaseName (T318592)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:07:32] alright, test away ^^ [13:07:51] This one should be a no-op [13:07:55] But let me take a quick look [13:07:59] ok, thanks [13:08:01] (03PS4) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 [13:08:48] (03PS5) 10Jbond: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) [13:09:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318605)', diff saved to https://phabricator.wikimedia.org/P37959 and previous config saved to /var/cache/conftool/dbconfig/20221103-130931-ladsgroup.json [13:09:35] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:09:42] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [13:09:53] (03CR) 10CI reject: [V: 04-1] worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (owner: 10Jbond) [13:11:01] (03CR) 10CI reject: [V: 04-1] controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [13:11:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P37960 and previous config saved to /var/cache/conftool/dbconfig/20221103-131103-ladsgroup.json [13:11:09] Lucas_WMDE: everything seems in order [13:11:13] ok [13:11:20] cmelo: Are you going to test on test2wiki or officewiki? [13:11:32] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [13:12:00] testwiki [13:12:13] test2wiki* [13:12:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:12:24] Ok, I'll test on officewiki then [13:12:49] @Daimona not seeing anything yet [13:13:12] what are we expecting? [13:13:13] Yep, the final patch still has to be deployed [13:13:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:13:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:13:20] yes [13:13:21] me neither [13:13:29] ok [13:14:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:15:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar) Note the contint machines require a public IPv4 address in order to be able to reach out WMCS instances. Currently we have: | fqdn | IPv4 |--|-- | contint1001.wikimedia.org... [13:16:10] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: labstore1007.wikimedia.org [13:16:23] (03PS2) 10Lucas Werkmeister (WMDE): Enable the CampaignEvents extension on test(2)wiki and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851079 (https://phabricator.wikimedia.org/T318592) (owner: 10Daimona Eaytoy) [13:16:26] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: labstore1006.wikimedia.org [13:16:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851079 (https://phabricator.wikimedia.org/T318592) (owner: 10Daimona Eaytoy) [13:17:32] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on test(2)wiki and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851079 (https://phabricator.wikimedia.org/T318592) (owner: 10Daimona Eaytoy) [13:18:25] (03PS1) 10Majavah: sonofgridengine: drop support for .wmflabs domains [puppet] - 10https://gerrit.wikimedia.org/r/852836 [13:18:38] (03PS6) 10Jbond: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) [13:18:59] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:35] Daimona, HouseOfM, cmelo: the change should be on mwdebug, please test [13:19:39] (03PS1) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [13:19:47] (did I miss a message or did scap not log this for some reason?) [13:19:57] Cool, thank you. Gonna take a few minutes to make sure everything's working correctly. [13:20:04] (And yes, I think it didn't !log) [13:20:11] (03PS1) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) [13:20:40] !log lucaswerkmeister-wmde and daimona: Backport for [[gerrit:851079|Enable the CampaignEvents extension on test(2)wiki and officewiki (T318592)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet (on behalf of scap – log message got lost?) [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:44] T318592: Deploy the CampaignEvents extension to production (testwiki, test2wiki, officewiki) - https://phabricator.wikimedia.org/T318592 [13:20:45] (03CR) 10CI reject: [V: 04-1] controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [13:20:48] (03PS25) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [13:20:53] (copy+pasted from the terminal, where it did print the message) [13:21:29] (03CR) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [13:21:32] @Daimona still not seeing anything on testwiki? [13:21:35] (03PS9) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [13:21:52] probably just me not knowing what I am doing [13:22:32] Make sure XWikimediaDebug is on [13:22:32] HouseOfM: are you using the WikimediaDebug extension? [13:22:48] I also can't see any change on test2wiki [13:23:11] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add golang 1.19 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852776 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [13:23:44] @Lucas_WMDE yeah, but it says testwiki is not a supported domain [13:24:15] o_O [13:24:23] (03PS1) 10David Caro: cloud: add alias for puppet_ca to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/852839 [13:24:35] it works for me in Firefox (I turned it on, reloaded, and saw the response header server: mwdebug1001.eqiad.wmnet in the network panel) [13:25:05] hmm, chrome doesn't like it at all [13:25:39] works for me in Chromium too [13:25:40] I can see the special pages on test2wiki, start testing the it now [13:26:21] 10SRE: librenms: consider using Distributed Poller with multiple netmon servers - https://phabricator.wikimedia.org/T171122 (10fgiunchedi) 05Open→03Declined I'm going to be bold and decline the task as we don't have any plans to tackle this [13:27:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @Jclark-ctr sorry for the late update, I depooled it about 2 hours ago (11:30 UTC), by editing Hiera in Horizon and pointing everything t... [13:28:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:851078|Remove $wgCampaignEventsDatabaseName (T318592)]] (duration: 08m 44s) [13:28:13] T318592: Deploy the CampaignEvents extension to production (testwiki, test2wiki, officewiki) - https://phabricator.wikimedia.org/T318592 [13:28:14] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: labstore1007.wikimedia.org [13:28:20] hmmm [13:28:24] that’s severely delayed [13:28:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: labstore1007.wikimedia.org [13:28:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318955)', diff saved to https://phabricator.wikimedia.org/P37961 and previous config saved to /var/cache/conftool/dbconfig/20221103-131614-ladsgroup.json [13:28:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1107.eqiad.wmnet with reason: Maintenance [13:28:28] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: labstore1006.wikimedia.org [13:28:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: labstore1006.wikimedia.org [13:28:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1107.eqiad.wmnet with reason: Maintenance [13:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T318955)', diff saved to https://phabricator.wikimedia.org/P37962 and previous config saved to /var/cache/conftool/dbconfig/20221103-131638-ladsgroup.json [13:28:31] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:28:56] I didn’t realize that message had still been missing [13:29:36] that “finished scap” of :28 was actually finished at :15 [13:30:51] (03CR) 10Ssingh: [C: 03+2] team-traffic: drop VarnishTrafficDrop and HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/852206 (https://phabricator.wikimedia.org/T322220) (owner: 10Ssingh) [13:31:19] test2wiki sounds good! [13:32:24] (03CR) 10David Caro: [V: 03+1 C: 03+2] webservice: add toolforge-* link for it [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro) [13:32:29] testwiki is good [13:32:53] what about officewiki? [13:33:00] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [13:33:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s start the gate-and-submit build already" [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [13:33:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Media border option applies to the media element, not the wrapper [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [13:33:55] @Daimona was on that one, how's it looking? [13:34:35] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/798394 (owner: 10PipelineBot) [13:34:40] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/801445 (owner: 10PipelineBot) [13:34:43] Yep, just finished testing, looks good AFAICS. I tested everything that came to mind, checked logstash and the DB. Everything's looking good. [13:34:44] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/801745 (owner: 10PipelineBot) [13:34:47] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/803305 (owner: 10PipelineBot) [13:34:49] ok, thanks! [13:34:50] (03PS2) 10David Caro: cloud: add alias for puppet_ca to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/852839 [13:34:54] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/808979 (owner: 10PipelineBot) [13:34:54] syncing [13:34:57] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809128 (owner: 10PipelineBot) [13:35:00] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/811243 (owner: 10PipelineBot) [13:35:05] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809127 (owner: 10PipelineBot) [13:35:07] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842394 (owner: 10PipelineBot) [13:35:11] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842398 (owner: 10PipelineBot) [13:35:14] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/843535 (owner: 10PipelineBot) [13:35:18] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844009 (owner: 10PipelineBot) [13:35:22] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844010 (owner: 10PipelineBot) [13:35:27] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844024 (owner: 10PipelineBot) [13:35:31] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844020 (owner: 10PipelineBot) [13:35:34] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844011 (owner: 10PipelineBot) [13:35:39] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/843530 (owner: 10PipelineBot) [13:37:13] (03CR) 10Vgutierrez: [C: 03+1] "NOOP for deployment-puppetmaster04: https://puppet-compiler.wmflabs.org/pcc-worker1001/37940/" [puppet] - 10https://gerrit.wikimedia.org/r/852839 (owner: 10David Caro) [13:39:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) It looks like they're all old connections that are probably stuck or unused, and will never terminate. There's no open connection that wa... [13:40:10] the scap finished (presumably the log message will come through eventually) [13:40:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [13:40:38] PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:40:40] arlolra: you’re still here, right? :) [13:40:46] right [13:40:51] yay [13:41:06] Lucas_WMDE: so we're done, right? [13:41:07] zuul predicts 7 more minutes for the gate-and-submit [13:41:09] Daimona: should be, yeah [13:41:17] Amazing, thank you :) [13:41:25] (03PS1) 10Elukey: istio: upgrade to 1.15.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852842 (https://phabricator.wikimedia.org/T322193) [13:42:29] (03CR) 10LSobanski: Set profile::contacts::role_contacts for contint* to ServiceOps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [13:42:32] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37941/console" [puppet] - 10https://gerrit.wikimedia.org/r/852839 (owner: 10David Caro) [13:43:01] (03CR) 10Elukey: "Built the images with docker-pkg, but haven't really find a good way to test them. I will try to boot minikube, but I'll probably need ist" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852842 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [13:43:33] anyone around who can look into logmsgbot? it seems to be delaying (maybe dropping?) messages [13:43:42] (03PS2) 10Ssingh: Release 2.0.0-3 [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) [13:43:56] (03PS1) 10Samtar: [prod noop] CommonSettings-labs: Fix beta cluster cxserver domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852843 (https://phabricator.wikimedia.org/T322323) [13:44:03] wikitech sounds like it runs on alert1001/alert2001 and I can’t seem to SSH into those hosts [13:45:17] Thank you Lucas [13:45:27] Thanks @Lucas_WMDE [13:45:58] (03CR) 10Ssingh: "(Removed Python 2 from debian/rules as well, which we missed.)" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:46:27] logmsgbot: help [13:46:33] meh, it was worth a shot [13:46:35] (03PS1) 10Ssingh: Release 0.4.6-3 [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/852844 (https://phabricator.wikimedia.org/T321309) [13:47:34] lmao, searching the sal tool for logmsgbot produces an error [13:47:53] I broked it 🎉 [13:49:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321123)', diff saved to https://phabricator.wikimedia.org/P37976 and previous config saved to /var/cache/conftool/dbconfig/20221103-134920-marostegui.json [13:49:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:49:24] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:49:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:49:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P37977 and previous config saved to /var/cache/conftool/dbconfig/20221103-134935-ladsgroup.json [13:49:37] (03Merged) 10jenkins-bot: Media border option applies to the media element, not the wrapper [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [13:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T321123)', diff saved to https://phabricator.wikimedia.org/P37978 and previous config saved to /var/cache/conftool/dbconfig/20221103-134943-marostegui.json [13:50:02] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:852314|Media border option applies to the media element, not the wrapper (T318300)]] [13:50:05] T318300: Media alignment broken with MinervaNeue since disabling wgParserEnableLegacyMediaDOM - https://phabricator.wikimedia.org/T318300 [13:50:26] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and arlolra: Backport for [[gerrit:852314|Media border option applies to the media element, not the wrapper (T318300)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:50:56] (03CR) 10David Caro: [C: 03+2] cloud: add alias for puppet_ca to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/852839 (owner: 10David Caro) [13:51:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P37979 and previous config saved to /var/cache/conftool/dbconfig/20221103-135113-ladsgroup.json [13:51:16] !log Finished scap: Backport for [[gerrit:851079|Enable the CampaignEvents extension on test(2)wiki and officewiki (T318592)]] (duration: 20m 46s) (originally at 13:38:40 UTC; logmsgbot dropped the message) [13:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:19] T318592: Deploy the CampaignEvents extension to production (testwiki, test2wiki, officewiki) - https://phabricator.wikimedia.org/T318592 [13:51:26] arlolra: please test :) [13:51:36] testing [13:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321123)', diff saved to https://phabricator.wikimedia.org/P37980 and previous config saved to /var/cache/conftool/dbconfig/20221103-135155-marostegui.json [13:52:06] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on dbproxy1019.eqiad.wmnet with reason: T313445 [13:52:08] T313445: hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 [13:52:19] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dbproxy1019.eqiad.wmnet with reason: T313445 [13:52:28] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (owner: 10Jbond) [13:52:30] Lucas_WMDE: lgtm, thanks [13:52:33] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [13:52:36] ok, thanks [13:52:40] syncing [13:52:51] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [13:53:38] (03PS1) 10Ssingh: Release 0.1-2 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/852886 (https://phabricator.wikimedia.org/T321309) [13:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318605)', diff saved to https://phabricator.wikimedia.org/P37981 and previous config saved to /var/cache/conftool/dbconfig/20221103-135454-ladsgroup.json [13:54:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [13:54:58] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:55:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [13:55:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [13:55:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [13:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T318605)', diff saved to https://phabricator.wikimedia.org/P37982 and previous config saved to /var/cache/conftool/dbconfig/20221103-135522-ladsgroup.json [13:55:47] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: ensure we create all directories [cookbooks] - 10https://gerrit.wikimedia.org/r/852782 (owner: 10Jbond) [13:55:52] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: ensure we create all directories [cookbooks] - 10https://gerrit.wikimedia.org/r/852782 [13:56:34] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:852314|Media border option applies to the media element, not the wrapper (T318300)]] (duration: 06m 31s) [13:56:37] T318300: Media alignment broken with MinervaNeue since disabling wgParserEnableLegacyMediaDOM - https://phabricator.wikimedia.org/T318300 [13:56:41] ok, now it logged properly again [13:56:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:56:58] !log UTC afternoon backport+config window done [13:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:28] (03PS1) 10Ssingh: Release 0.3 [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/852888 (https://phabricator.wikimedia.org/T321309) [13:57:31] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) I have checked with traffic, and we can effectively start by removing the trafficserver mapping via https://gerrit.wikimedia.org/r/... [13:57:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:57:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:58:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:58:33] Lucas_WMDE: thanks! [13:59:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10Ottomata) Approved [13:59:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10Ottomata) Approved [13:59:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Ottomata) Approved [13:59:54] 10SRE, 10Infrastructure-Foundations: IDM: Central logging on all changes - https://phabricator.wikimedia.org/T320431 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03Low a:03SLyngshede-WMF [13:59:56] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [14:01:46] (03PS13) 10Jbond: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [14:01:58] (03PS33) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [14:02:03] (03PS26) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [14:02:08] (03PS10) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [14:02:36] (03CR) 10Ssingh: [C: 03+2] Release 9.1.3-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:02:38] (03PS1) 10Slyngshede: C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) [14:02:44] anyone mind verrry quickly giving https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/852843 a once-over? beta cluster only, noop for prod, changing `https://cxserver-beta.wmflabs.org` to `https://cxserver-beta.wmcloud.org`, for T322323 [14:02:44] T322323: Special:ContentTranslation unable to connect to cxserver on Beta Cluster - https://phabricator.wikimedia.org/T322323 [14:03:12] (03CR) 10CI reject: [V: 04-1] C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [14:03:36] (else I'll self+2 and test it on beta) [14:04:20] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s8_3318: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s [14:04:20] Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s1_3311: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s7_3317: Servers dbproxy1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:04:31] (03Abandoned) 10Ssingh: aptrepo: add trafficserver9 to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/849640 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:04:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Ottomata) Approved. [14:04:47] (03PS2) 10Slyngshede: C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) [14:04:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318955)', diff saved to https://phabricator.wikimedia.org/P37983 and previous config saved to /var/cache/conftool/dbconfig/20221103-140447-ladsgroup.json [14:04:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [14:04:50] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s8_3318: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s [14:04:50] Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s1_3311: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s7_3317: Servers dbproxy1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:04:51] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:05:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [14:05:04] (03CR) 10Jelto: [C: 04-1] "I left one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T318955)', diff saved to https://phabricator.wikimedia.org/P37984 and previous config saved to /var/cache/conftool/dbconfig/20221103-140509-ladsgroup.json [14:05:30] (03CR) 10CI reject: [V: 04-1] C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [14:05:32] (03CR) 10Ssingh: [C: 03+2] Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318605)', diff saved to https://phabricator.wikimedia.org/P37985 and previous config saved to /var/cache/conftool/dbconfig/20221103-140541-ladsgroup.json [14:05:42] PROBLEM - Host dbproxy1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:05:44] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:05:49] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) [14:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318955)', diff saved to https://phabricator.wikimedia.org/P37986 and previous config saved to /var/cache/conftool/dbconfig/20221103-140621-ladsgroup.json [14:06:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:06:31] (03CR) 10Jbond: [C: 03+1] paws: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842757 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:06:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T318955)', diff saved to https://phabricator.wikimedia.org/P37987 and previous config saved to /var/cache/conftool/dbconfig/20221103-140643-ladsgroup.json [14:07:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P37988 and previous config saved to /var/cache/conftool/dbconfig/20221103-140703-marostegui.json [14:07:11] (03CR) 10Jbond: [C: 03+1] "lgtmn" [puppet] - 10https://gerrit.wikimedia.org/r/852830 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:07:53] (03CR) 10Samtar: [C: 03+2] "beta cluster only, self-deploy, production noop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852843 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [14:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318955)', diff saved to https://phabricator.wikimedia.org/P37989 and previous config saved to /var/cache/conftool/dbconfig/20221103-140753-ladsgroup.json [14:08:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [prod noop] CommonSettings-labs: Fix beta cluster cxserver domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852843 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [14:08:43] (03Merged) 10jenkins-bot: [prod noop] CommonSettings-labs: Fix beta cluster cxserver domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852843 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [14:09:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852843 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [14:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318955)', diff saved to https://phabricator.wikimedia.org/P37990 and previous config saved to /var/cache/conftool/dbconfig/20221103-140926-ladsgroup.json [14:09:58] (03PS2) 10Muehlenhoff: Set profile::contacts::role_contacts for contint* to ServiceOps-Collab [puppet] - 10https://gerrit.wikimedia.org/r/852832 [14:10:00] (ah I do wish `scap backport` wouldn't re-comment for already +2'd changes) [14:10:11] s/+2'd/merged [14:10:49] (03CR) 10Muehlenhoff: Set profile::contacts::role_contacts for contint* to ServiceOps-Collab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [14:11:52] !log Sunsetting search.wikimedia.org, starting a 2 week grace period before decommission - T316296 [14:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:55] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [14:11:55] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [14:13:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:14:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:14:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:14:37] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) Starting 2 week grace period from today, full decom to happen after 2022-11-17 [14:15:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:16:23] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [14:17:25] Lucas_WMDE: thanks for the +1 ^^ (still getting an sorta-expected `The page's settings blocked the loading of a resource at https://cxserver-beta.wmcloud.org/v2/list/languagepairs ("default-src").` though... any ideas?) [14:17:54] RECOVERY - Host dbproxy1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.74 ms [14:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318605)', diff saved to https://phabricator.wikimedia.org/P37991 and previous config saved to /var/cache/conftool/dbconfig/20221103-142044-ladsgroup.json [14:20:48] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:20:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P37992 and previous config saved to /var/cache/conftool/dbconfig/20221103-142055-ladsgroup.json [14:21:28] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) 05Open→03In progress [14:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P37993 and previous config saved to /var/cache/conftool/dbconfig/20221103-142215-marostegui.json [14:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P37994 and previous config saved to /var/cache/conftool/dbconfig/20221103-142301-ladsgroup.json [14:24:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P37995 and previous config saved to /var/cache/conftool/dbconfig/20221103-142434-ladsgroup.json [14:26:16] TheresNoTime: no idea, sorry [14:26:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [14:26:34] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [14:26:35] * TheresNoTime cry. [14:26:45] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [14:27:08] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [14:29:32] didn't you say just yesterday that you weren't talking about cxserver on deployment-prep :P [14:30:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:37] where is that CSP entry even added? I'm not finding it frm mw-config.git or ext/ContentTranslation.git [14:32:56] (03PS1) 10Ssingh: package_builder: add hook for varnish6 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) [14:33:43] (03CR) 10CI reject: [V: 04-1] package_builder: add hook for varnish6 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:33:54] TheresNoTime: are you sure it ever worked before? [14:34:19] taavi: no idea if it worked before! guessing/hoping it did [14:35:27] TheresNoTime: I suspect https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings.php#25987 makes it work for production, but it doens't work for beta due to the different naming scheme [14:35:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P37996 and previous config saved to /var/cache/conftool/dbconfig/20221103-143552-ladsgroup.json [14:35:59] gah! D: I was hoping this would be Quick And Easy(tm) [14:36:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P37997 and previous config saved to /var/cache/conftool/dbconfig/20221103-143603-ladsgroup.json [14:36:08] (03PS2) 10Ssingh: package_builder: add hook for varnish6 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) [14:36:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:30] need a link to the stewardship request task or do you have it somewhere already? [14:36:39] hahah! [14:37:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321123)', diff saved to https://phabricator.wikimedia.org/P37998 and previous config saved to /var/cache/conftool/dbconfig/20221103-143722-marostegui.json [14:37:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:37:26] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:37:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:37:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T321123)', diff saved to https://phabricator.wikimedia.org/P37999 and previous config saved to /var/cache/conftool/dbconfig/20221103-143745-marostegui.json [14:38:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P38000 and previous config saved to /var/cache/conftool/dbconfig/20221103-143809-ladsgroup.json [14:38:18] !log fnegri@cumin1001 conftool action : set/pooled=no; selector: service=wikireplicas-b,name=dbproxy1019 [14:39:38] I guess that lvs complaining about wikireplicas is you Amir1? [14:39:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P38001 and previous config saved to /var/cache/conftool/dbconfig/20221103-143943-ladsgroup.json [14:40:09] vgutierrez: it could be, which section? [14:40:31] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1018&service=PyBal+backends+health+check [14:40:39] (03PS1) 10DCausse: team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) [14:40:51] s2, s3 1, s4 and s7 apparently [14:40:56] *s1 [14:41:27] !log fnegri@cumin1001 conftool action : set/pooled=no; selector: service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [14:41:47] vgutierrez: orch says they are working fine https://orchestrator.wikimedia.org/web/cluster/alias/s8 [14:41:51] 10SRE, 10Epic, 10Maps (Kartotherian): Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10awight) [14:42:10] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [14:42:12] 10SRE, 10Epic, 10Maps (Kartotherian): Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10awight) Should be upgraded to reflect tilerator's demise. Maybe geoshapes should be a separate service, maybe not? [14:42:24] I think haproxy needs a restart, I remember it happening from time to time [14:42:50] Amir1: so those wikireplicas services just check dbproxy1018 [14:42:52] see https://config-master.wikimedia.org/pybal/eqiad/wikireplicas-a-s4 [14:42:58] (03PS3) 10Slyngshede: C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) [14:43:02] (03CR) 10CI reject: [V: 04-1] team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) (owner: 10DCausse) [14:43:07] let me restart it [14:43:25] did it, let's see [14:43:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) The first attempt at depooling didn't work because I didn't specify a full hostname. This one worked: ` $ sudo confctl select "service=w... [14:44:36] hmmm I'm afraid it's related to fnegri work on dbproxy1019 [14:44:57] 10SRE, 10Epic, 10Maps (Kartotherian): Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10awight) [14:45:04] dhinus: ^ [14:45:19] !log fnegri@cumin1001 conftool action : set/pooled=no; selector: name=dbproxy1019.eqiad.wmnet [14:45:35] yeah.... those services only have 1 backend server.. so of course it's gonna scream :) [14:45:40] Amir1: yes on it, sorry [14:46:16] ^_^ [14:46:21] (03PS1) 10Volans: constants: use CORE_DATACENTERS from wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/852902 [14:46:23] (03PS1) 10Volans: ipmi: clarify that the target can also be an IP [software/spicerack] - 10https://gerrit.wikimedia.org/r/852903 (https://phabricator.wikimedia.org/T320721) [14:46:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321123)', diff saved to https://phabricator.wikimedia.org/P38002 and previous config saved to /var/cache/conftool/dbconfig/20221103-144658-marostegui.json [14:47:02] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:48:12] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.243:3318, 208.80.154.243:3315, 208.80.154.243:3314, 208.80.154.243:3317, 208.80.154.243:3316, 208.80.154.243:3311, 208.80.154.243:3313, 208.80.154.243:3312]) https://wikitech.wikimedia.org/wiki/PyBal [14:48:23] (03PS2) 10DCausse: team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) [14:49:10] hmm looks like my depooling wasn't entirely right :P [14:49:17] dhinus: you can't depool it actually [14:49:28] https://config-master.wikimedia.org/pybal/eqiad/wikireplicas-b-s1 [14:49:33] take that one as an example [14:50:09] pybal is going to enforce it doesn't allow to have 1 service without any backend servers pooled [14:50:33] sorry, very much an LVS noob... I was following a wiki page that made me hope there would be no alerts :P [14:50:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Papaul) When the server was in row c it was in vlan-id 1119 cloud-support1-c-eqiad in row B where the server is now, we have no cloud-support1-... [14:50:42] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.243:3318, 208.80.154.243:3315, 208.80.154.243:3314, 208.80.154.243:3317, 208.80.154.243:3316, 208.80.154.243:3311, 208.80.154.243:3313, 208.80.154.243:3312]) https://wikitech.wikimedia.org/wiki/PyBal [14:50:50] (03CR) 10CI reject: [V: 04-1] team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) (owner: 10DCausse) [14:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P38003 and previous config saved to /var/cache/conftool/dbconfig/20221103-145101-ladsgroup.json [14:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T318605)', diff saved to https://phabricator.wikimedia.org/P38004 and previous config saved to /var/cache/conftool/dbconfig/20221103-145110-ladsgroup.json [14:51:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:51:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:51:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38005 and previous config saved to /var/cache/conftool/dbconfig/20221103-145133-ladsgroup.json [14:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318955)', diff saved to https://phabricator.wikimedia.org/P38006 and previous config saved to /var/cache/conftool/dbconfig/20221103-145316-ladsgroup.json [14:53:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [14:53:20] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:53:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [14:53:32] if I pool it on dbproxy1018 (which is up and "pooled=inactive") would it fix the alert? [14:53:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T318955)', diff saved to https://phabricator.wikimedia.org/P38007 and previous config saved to /var/cache/conftool/dbconfig/20221103-145339-ladsgroup.json [14:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318955)', diff saved to https://phabricator.wikimedia.org/P38008 and previous config saved to /var/cache/conftool/dbconfig/20221103-145453-ladsgroup.json [14:54:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [14:55:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [14:55:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T318955)', diff saved to https://phabricator.wikimedia.org/P38009 and previous config saved to /var/cache/conftool/dbconfig/20221103-145516-ladsgroup.json [14:56:13] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Appledora - https://phabricator.wikimedia.org/T322222 (10jbond) 05Open→03Resolved a:03jbond @Appledora i have now added you to the wmf ldap group all other permissions where already in place. please let me know if there are any other issues [14:56:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318955)', diff saved to https://phabricator.wikimedia.org/P38010 and previous config saved to /var/cache/conftool/dbconfig/20221103-145620-ladsgroup.json [14:56:37] !log fnegri@cumin1001 conftool action : set/pooled=yes; selector: name=dbproxy1018.eqiad.wmnet [14:56:40] (03PS3) 10DCausse: team-search-platform: alert when CirrusSearch jobs are backlogged [alerts] - 10https://gerrit.wikimedia.org/r/852899 (https://phabricator.wikimedia.org/T312175) [14:56:40] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:58:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318955)', diff saved to https://phabricator.wikimedia.org/P38011 and previous config saved to /var/cache/conftool/dbconfig/20221103-145759-ladsgroup.json [14:58:14] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10jbond) 05Open→03Resolved a:03jbond @KMorgan-WMF i have added you to the wmf ldap group please let me know if you are still having issues [14:59:15] (03PS1) 10Hashar: extdist: remove integration/composer.git [puppet] - 10https://gerrit.wikimedia.org/r/852906 (https://phabricator.wikimedia.org/T293055) [14:59:27] (03CR) 10Muehlenhoff: C:idm::deployment add required packages for testing. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [15:00:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:12] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:00:14] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Ilooremeta - https://phabricator.wikimedia.org/T321918 (10jbond) 05Open→03Resolved a:03jbond @ILooremeta-WMF i have added you to the wmf ldap group please let me know if you are still having issues [15:02:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P38012 and previous config saved to /var/cache/conftool/dbconfig/20221103-150208-marostegui.json [15:02:39] (03PS4) 10Slyngshede: C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) [15:02:47] (03CR) 10Slyngshede: C:idm::deployment add required packages for testing. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [15:03:11] (03PS1) 10Muehlenhoff: Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 [15:03:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @Papaul it wasn't mentioned in the task, checking. [15:03:48] (03CR) 10CI reject: [V: 04-1] Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:05:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [15:06:07] (03PS1) 10JMeybohm: Move kube-scheduler config to file [puppet] - 10https://gerrit.wikimedia.org/r/852908 (https://phabricator.wikimedia.org/T300499) [15:06:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318605)', diff saved to https://phabricator.wikimedia.org/P38013 and previous config saved to /var/cache/conftool/dbconfig/20221103-150610-ladsgroup.json [15:06:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:06:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:06:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:06:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38014 and previous config saved to /var/cache/conftool/dbconfig/20221103-150633-ladsgroup.json [15:08:04] (03PS1) 10Ottomata: Set eventgate service for rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852909 (https://phabricator.wikimedia.org/T307959) [15:08:15] (03PS2) 10Muehlenhoff: Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 [15:08:54] (03PS2) 10Ottomata: Set eventgate service for rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852909 (https://phabricator.wikimedia.org/T307959) [15:09:08] (03CR) 10CI reject: [V: 04-1] Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:09:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:10:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/852886 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:10:43] (03CR) 10Ottomata: [C: 03+2] Set eventgate service for rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852909 (https://phabricator.wikimedia.org/T307959) (owner: 10Ottomata) [15:11:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P38015 and previous config saved to /var/cache/conftool/dbconfig/20221103-151129-ladsgroup.json [15:11:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/852844 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:11:50] (03Merged) 10jenkins-bot: Set eventgate service for rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852909 (https://phabricator.wikimedia.org/T307959) (owner: 10Ottomata) [15:12:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/852888 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:12:14] (03CR) 10Vgutierrez: [C: 03+1] Release 6.0.10-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:12:38] (03PS1) 10Jbond: admin: add various ldap accounts [puppet] - 10https://gerrit.wikimedia.org/r/852910 (https://phabricator.wikimedia.org/T321902) [15:12:47] 10SRE, 10Infrastructure-Foundations: IDM: Central logging on all changes - https://phabricator.wikimedia.org/T320431 (10MoritzMuehlenhoff) Let's keep this open until we also have logrotate configs? [15:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P38016 and previous config saved to /var/cache/conftool/dbconfig/20221103-151307-ladsgroup.json [15:13:20] (03CR) 10CI reject: [V: 04-1] admin: add various ldap accounts [puppet] - 10https://gerrit.wikimedia.org/r/852910 (https://phabricator.wikimedia.org/T321902) (owner: 10Jbond) [15:13:57] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/nda for Hibashaath - https://phabricator.wikimedia.org/T321902 (10jbond) @HShaath-WMF i have added you to the wmf ldap group which should provide the access you need please let me know if there is still an issue [15:14:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/nda for Hibashaath - https://phabricator.wikimedia.org/T321902 (10jbond) 05Open→03Resolved a:03jbond [15:14:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:15:30] (03PS2) 10Jbond: admin: add various ldap accounts [puppet] - 10https://gerrit.wikimedia.org/r/852910 (https://phabricator.wikimedia.org/T321902) [15:15:39] (03CR) 10Btullis: [C: 03+1] "Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:16:16] (03CR) 10Ssingh: [C: 03+2] Release 0.4.6-3 [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/852844 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:16:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:16:27] (03CR) 10Vgutierrez: [C: 03+1] package_builder: add hook for varnish6 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:16:56] (03CR) 10Ssingh: [C: 03+2] Release 0.3 [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/852888 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:17:12] (03CR) 10Vgutierrez: [C: 03+1] Release 2.0.0-3 [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:17:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:17:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:17:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P38017 and previous config saved to /var/cache/conftool/dbconfig/20221103-151716-marostegui.json [15:17:27] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set destination_event_service for rc0.mediawiki.page_content_change to fix canary producer job (duration: 03m 36s) [15:17:41] (03CR) 10Ssingh: [C: 03+2] Release 2.0.0-3 [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:17:45] (03CR) 10Vgutierrez: [C: 03+1] Release 0.1-2 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/852886 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:17:51] !log comment out www-data crontab on cloudmetrics100{1,2} T297712 [15:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:54] T297712: Migrate cloudmetrics workload from cloudmetrics100[1-2] to cloudmetrics100[3-4] - https://phabricator.wikimedia.org/T297712 [15:18:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:18:14] (03CR) 10Muehlenhoff: [C: 04-1] "Actually, this won't be enough: Since Bullseye has Varnish 6.5, this would install the Bullseye version. So we need some additional pinnin" [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:18:48] (03CR) 10Ssingh: [C: 03+2] package_builder: add hook for varnish6 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:18:56] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/nda for Hghani - https://phabricator.wikimedia.org/T321910 (10jbond) 05Open→03Resolved a:03jbond @Hghani i have added you to the wmf ldap group let me know if there are any issue [15:19:14] (03PS1) 10Hnowlan: poolcounter: Await connect coroutine [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852911 (https://phabricator.wikimedia.org/T233196) [15:19:40] (03CR) 10Jbond: [C: 03+2] admin: add various ldap accounts [puppet] - 10https://gerrit.wikimedia.org/r/852910 (https://phabricator.wikimedia.org/T321902) (owner: 10Jbond) [15:19:44] (03CR) 10Ssingh: [C: 03+2] Release 0.1-2 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/852886 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:20:14] oh wow [15:20:16] sorry moritzm [15:20:22] I missed your -1 [15:20:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/852897 reverting [15:20:43] (03PS1) 10Vgutierrez: swift: Ramp up ms-be07 balance [puppet] - 10https://gerrit.wikimedia.org/r/852912 (https://phabricator.wikimedia.org/T322231) [15:20:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37942/console" [puppet] - 10https://gerrit.wikimedia.org/r/852908 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [15:21:05] sukhe: no need to revert [15:21:12] we can just amend the existing hook [15:21:21] ok yeah, this isn't critical yet [15:21:51] (03CR) 10MVernon: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/852912 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [15:21:57] (03CR) 10Vgutierrez: [C: 03+1] varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [15:22:02] moritzm: but why would this need additional pinning though? doesn't apt have a higher priority anyway? [15:22:11] that's what we are doing for the current install in buster [15:22:14] (03CR) 10Vgutierrez: [C: 03+2] swift: Ramp up ms-be07 balance [puppet] - 10https://gerrit.wikimedia.org/r/852912 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [15:25:24] (03CR) 10Vlad.shapik: [C: 03+1] "Good catch. Thank you, Hugh." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852911 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:25:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10jbond) p:05Triage→03Medium [15:26:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10jbond) @ILooremeta-WMF are you able to sign the L3 agreement @CMacholan are you able to approve this request? [15:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P38018 and previous config saved to /var/cache/conftool/dbconfig/20221103-152638-ladsgroup.json [15:26:47] (03CR) 10Muehlenhoff: [C: 04-1] package_builder: add hook for varnish6 (bullseye) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:27:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10jbond) [15:27:36] (03CR) 10Hnowlan: [C: 03+2] poolcounter: Await connect coroutine [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852911 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:28:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P38019 and previous config saved to /var/cache/conftool/dbconfig/20221103-152817-ladsgroup.json [15:28:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) [15:29:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) @Hghani can you please sign the L3 document @CMacholan are you able to approve this request [15:29:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) p:05Triage→03Medium [15:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:04] (03CR) 10Klausman: [C: 03+1] istio: upgrade to 1.15.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852842 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [15:31:49] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Release 6.0.10-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:32:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) I discussed this with @aborrero and we think "VLAN private1-b-eqiad (1018)" will work. [15:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321123)', diff saved to https://phabricator.wikimedia.org/P38020 and previous config saved to /var/cache/conftool/dbconfig/20221103-153224-marostegui.json [15:32:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:32:28] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:32:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:32:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:32:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:33:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:33:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:33:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [15:33:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [15:34:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T321123)', diff saved to https://phabricator.wikimedia.org/P38021 and previous config saved to /var/cache/conftool/dbconfig/20221103-153404-marostegui.json [15:34:28] (03PS1) 10Jbond: admin: add Hibashaath to analytics-private [puppet] - 10https://gerrit.wikimedia.org/r/852914 (https://phabricator.wikimedia.org/T322146) [15:35:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:35:24] (03CR) 10CI reject: [V: 04-1] admin: add Hibashaath to analytics-private [puppet] - 10https://gerrit.wikimedia.org/r/852914 (https://phabricator.wikimedia.org/T322146) (owner: 10Jbond) [15:35:54] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38022 and previous config saved to /var/cache/conftool/dbconfig/20221103-153553-ladsgroup.json [15:35:57] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:36:14] 10SRE, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Epic, and 3 others: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10awight) [15:36:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321123)', diff saved to https://phabricator.wikimedia.org/P38023 and previous config saved to /var/cache/conftool/dbconfig/20221103-153628-marostegui.json [15:37:04] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:13] (03PS2) 10Jbond: admin: add Hibashaath to analytics-private [puppet] - 10https://gerrit.wikimedia.org/r/852914 (https://phabricator.wikimedia.org/T322146) [15:37:32] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [15:38:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/852902 (owner: 10Volans) [15:38:30] !log fnegri@cumin1001 conftool action : set/pooled=inactive; selector: name=dbproxy1019.eqiad.wmnet [15:38:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/852903 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [15:39:32] (03Merged) 10jenkins-bot: poolcounter: Await connect coroutine [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852911 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:40:30] (03PS3) 10Jbond: Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:41:02] (03PS4) 10Jbond: Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:41:06] (03CR) 10CI reject: [V: 04-1] Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:41:24] PROBLEM - Disk space on conf1007 is CRITICAL: DISK CRITICAL - free space: / 2620 MB (3% inode=98%): /tmp 2620 MB (3% inode=98%): /var/tmp 2620 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1007&var-datasource=eqiad+prometheus/ops [15:41:37] (03CR) 10CI reject: [V: 04-1] Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:41:39] (03CR) 10Jbond: [C: 03+1] "LGTm i fixed the CI issue and also added sre-admins" [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318955)', diff saved to https://phabricator.wikimedia.org/P38024 and previous config saved to /var/cache/conftool/dbconfig/20221103-154145-ladsgroup.json [15:41:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [15:41:49] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:42:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [15:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T318955)', diff saved to https://phabricator.wikimedia.org/P38025 and previous config saved to /var/cache/conftool/dbconfig/20221103-154209-ladsgroup.json [15:42:24] (03PS5) 10Jbond: Extend validate_common_ops_group check for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:42:30] RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318955)', diff saved to https://phabricator.wikimedia.org/P38026 and previous config saved to /var/cache/conftool/dbconfig/20221103-154325-ladsgroup.json [15:43:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:43:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38027 and previous config saved to /var/cache/conftool/dbconfig/20221103-154339-ladsgroup.json [15:43:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:43:43] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:43:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P38028 and previous config saved to /var/cache/conftool/dbconfig/20221103-154347-ladsgroup.json [15:44:48] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10Stevemunene) [15:44:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318955)', diff saved to https://phabricator.wikimedia.org/P38029 and previous config saved to /var/cache/conftool/dbconfig/20221103-154449-ladsgroup.json [15:45:32] (03CR) 10Jdlrobson: "This change is ready for review." [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852882 (owner: 10Jdlrobson) [15:46:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P38030 and previous config saved to /var/cache/conftool/dbconfig/20221103-154631-ladsgroup.json [15:46:45] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:46:59] (03CR) 10Muehlenhoff: [C: 03+2] "Good catch wrt sre-admins! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/852907 (owner: 10Muehlenhoff) [15:47:30] PROBLEM - Disk space on conf1009 is CRITICAL: DISK CRITICAL - free space: / 2597 MB (3% inode=98%): /tmp 2597 MB (3% inode=98%): /var/tmp 2597 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1009&var-datasource=eqiad+prometheus/ops [15:47:40] (03PS3) 10Jdlrobson: Finish moving to Page Tools naming convention [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852882 [15:48:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P38031 and previous config saved to /var/cache/conftool/dbconfig/20221103-155101-ladsgroup.json [15:51:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P38032 and previous config saved to /var/cache/conftool/dbconfig/20221103-155136-marostegui.json [15:52:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Papaul) new IP address for the hosts is `` 10.64.16.14/22 `` switch also setup ` [edit interfaces ge-6/0/28] - description DISABLED; + des... [15:52:25] (03CR) 10Elukey: Update the spark and spark-operator images (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [15:52:39] (03PS2) 10Muehlenhoff: mail: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850470 (https://phabricator.wikimedia.org/T308013) [15:52:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Papaul) ` papaul@asw2-b-eqiad> show interfaces ge-6/0/28 descriptions Interface Admin Link Description ge-6/0/28 up down dbproxy10... [15:53:18] (03CR) 10Elukey: [C: 03+1] wikilabels: Cleanup old DB proxy information [puppet] - 10https://gerrit.wikimedia.org/r/852779 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [15:53:54] (03PS2) 10Klausman: wikilabels: Cleanup old DB proxy information [puppet] - 10https://gerrit.wikimedia.org/r/852779 (https://phabricator.wikimedia.org/T307389) [15:54:01] (03PS1) 10Vgutierrez: Release 0.35 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/852917 (https://phabricator.wikimedia.org/T244232) [15:54:21] (03CR) 10Klausman: [C: 03+2] wikilabels: Cleanup old DB proxy information [puppet] - 10https://gerrit.wikimedia.org/r/852779 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [15:54:32] (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for role::dns::auth [puppet] - 10https://gerrit.wikimedia.org/r/852918 [15:54:40] !log sudo -i reprepro -C main include bullseye-wikimedia varnish_6.0.10-1wm2_amd64.changes: T321309 [15:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:43] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:55:25] (03CR) 10Elukey: [C: 03+1] Move kube-scheduler config to file [puppet] - 10https://gerrit.wikimedia.org/r/852908 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [15:55:42] (03CR) 10Muehlenhoff: [C: 03+2] mail: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850470 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:58:14] (03PS1) 10Samtar: [prod noop] InitialiseSettings-labs: wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852920 (https://phabricator.wikimedia.org/T322323) [15:58:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P38033 and previous config saved to /var/cache/conftool/dbconfig/20221103-155847-ladsgroup.json [15:58:53] (03CR) 10Dzahn: "thanks for merging this :)" [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [15:59:10] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:59:15] (03PS2) 10Samtar: [prod noop] InitialiseSettings-labs: wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852920 (https://phabricator.wikimedia.org/T322323) [15:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P38034 and previous config saved to /var/cache/conftool/dbconfig/20221103-155958-ladsgroup.json [16:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:20] (03CR) 10Vgutierrez: [C: 03+2] Release 0.35 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/852917 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:01:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P38035 and previous config saved to /var/cache/conftool/dbconfig/20221103-160141-ladsgroup.json [16:02:58] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1019.eqiad.wmnet with OS bullseye [16:03:05] (03PS1) 10Elukey: Import istioctl 1.15.3 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/852921 (https://phabricator.wikimedia.org/T322193) [16:03:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host dbproxy1019.eqiad.wmnet with OS bullseye [16:03:24] (03CR) 10Samtar: [C: 03+2] "beta cluster only, self-deploy, production noop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852920 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [16:03:50] (03PS1) 10Jbond: k8s::package: only install the apt source once [puppet] - 10https://gerrit.wikimedia.org/r/852922 [16:04:17] (03PS2) 10Elukey: Import istioctl 1.15.3 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/852921 (https://phabricator.wikimedia.org/T322193) [16:04:20] (03Merged) 10jenkins-bot: [prod noop] InitialiseSettings-labs: wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852920 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [16:05:22] 10SRE, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Epic, and 3 others: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10awight) [16:06:09] (03PS2) 10Jbond: k8s::package: only install the apt source once [puppet] - 10https://gerrit.wikimedia.org/r/852922 [16:06:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P38036 and previous config saved to /var/cache/conftool/dbconfig/20221103-160611-ladsgroup.json [16:06:18] (03PS1) 10KartikMistry: Set ContentTranslation MT threshold to 75 in Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852924 (https://phabricator.wikimedia.org/T321819) [16:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P38037 and previous config saved to /var/cache/conftool/dbconfig/20221103-160645-marostegui.json [16:08:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:09:53] (03CR) 10Dzahn: [C: 04-1] "this is going to be out-of-date after I570768bb03df32 but I will recycle this to switch hosts soon" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [16:10:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:10:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:10:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:13:43] (03CR) 10Dzahn: [C: 03+1] R:rsync::manifests::server::module: add type validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [16:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P38039 and previous config saved to /var/cache/conftool/dbconfig/20221103-161356-ladsgroup.json [16:15:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P38041 and previous config saved to /var/cache/conftool/dbconfig/20221103-161507-ladsgroup.json [16:15:45] (03PS3) 10Jbond: k8s::package: only install the apt source once [puppet] - 10https://gerrit.wikimedia.org/r/852922 [16:16:01] (03PS1) 10Samtar: [prod noop] InitialiseSettings-labs: Enable ContentTranslation on beta wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852946 (https://phabricator.wikimedia.org/T322325) [16:16:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P38042 and previous config saved to /var/cache/conftool/dbconfig/20221103-161648-ladsgroup.json [16:17:07] (03PS1) 10Vgutierrez: api: support sha256 checksums [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852947 [16:17:09] (03PS1) 10Vgutierrez: api: Offer JSON for metadata if requested [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852948 [16:17:11] (03PS1) 10Vgutierrez: readme: Add general notes for testing deps [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852949 [16:17:13] (03PS1) 10Vgutierrez: acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852950 (https://phabricator.wikimedia.org/T244232) [16:17:15] (03PS1) 10Vgutierrez: Release 0.35 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852951 (https://phabricator.wikimedia.org/T244232) [16:18:03] !log reprepro -C main include bullseye-wikimedia trafficserver_9.1.3-1wm3_amd64.changes: T321309 [16:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:06] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage [16:18:06] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:18:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37945/console" [puppet] - 10https://gerrit.wikimedia.org/r/852922 (owner: 10Jbond) [16:20:02] (03PS2) 10Samtar: [prod noop] InitialiseSettings-labs: Enable ContentTranslation on beta wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852946 (https://phabricator.wikimedia.org/T322325) [16:20:35] (03PS4) 10JMeybohm: k8s::package: only install the apt source once [puppet] - 10https://gerrit.wikimedia.org/r/852922 (https://phabricator.wikimedia.org/T270271) (owner: 10Jbond) [16:20:47] (03CR) 10CI reject: [V: 04-1] readme: Add general notes for testing deps [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852949 (owner: 10Vgutierrez) [16:20:49] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage [16:20:50] (03CR) 10CI reject: [V: 04-1] acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852950 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:21:05] (03CR) 10CI reject: [V: 04-1] Release 0.35 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852951 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:21:07] (03PS1) 10Hnowlan: thumbor: reduce memory limit, add service nodeport [deployment-charts] - 10https://gerrit.wikimedia.org/r/852953 (https://phabricator.wikimedia.org/T233196) [16:21:11] (03CR) 10CI reject: [V: 04-1] api: support sha256 checksums [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852947 (owner: 10Vgutierrez) [16:21:16] (03CR) 10CI reject: [V: 04-1] api: Offer JSON for metadata if requested [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852948 (owner: 10Vgutierrez) [16:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38043 and previous config saved to /var/cache/conftool/dbconfig/20221103-162118-ladsgroup.json [16:21:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:21:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:21:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:21:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38044 and previous config saved to /var/cache/conftool/dbconfig/20221103-162141-ladsgroup.json [16:21:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321123)', diff saved to https://phabricator.wikimedia.org/P38045 and previous config saved to /var/cache/conftool/dbconfig/20221103-162152-marostegui.json [16:21:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [16:21:56] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:22:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [16:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T321123)', diff saved to https://phabricator.wikimedia.org/P38046 and previous config saved to /var/cache/conftool/dbconfig/20221103-162214-marostegui.json [16:22:45] (03PS2) 10Vgutierrez: api: support sha256 checksums [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852947 [16:22:47] (03PS2) 10Vgutierrez: api: Offer JSON for metadata if requested [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852948 [16:22:49] (03PS2) 10Vgutierrez: readme: Add general notes for testing deps [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852949 [16:22:51] (03PS2) 10Vgutierrez: acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852950 (https://phabricator.wikimedia.org/T244232) [16:22:53] (03PS2) 10Vgutierrez: Release 0.35 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852951 (https://phabricator.wikimedia.org/T244232) [16:22:55] (03PS1) 10Vgutierrez: limit jinja2, itsdangerous and werkzeug version [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852954 [16:24:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321123)', diff saved to https://phabricator.wikimedia.org/P38047 and previous config saved to /var/cache/conftool/dbconfig/20221103-162437-marostegui.json [16:27:03] (03CR) 10Vgutierrez: [C: 03+2] limit jinja2, itsdangerous and werkzeug version [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852954 (owner: 10Vgutierrez) [16:27:13] (03CR) 10Jbond: [C: 03+2] admin: add Hibashaath to analytics-private [puppet] - 10https://gerrit.wikimedia.org/r/852914 (https://phabricator.wikimedia.org/T322146) (owner: 10Jbond) [16:28:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10jbond) 05Open→03Resolved a:03jbond @HShaath-WMF this access should be in place not let me know if you encounter any issues [16:28:09] (03CR) 10Vgutierrez: [C: 03+2] api: support sha256 checksums [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852947 (owner: 10Vgutierrez) [16:28:13] (03CR) 10Vgutierrez: [C: 03+2] api: Offer JSON for metadata if requested [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852948 (owner: 10Vgutierrez) [16:28:16] (03CR) 10Vgutierrez: [C: 03+2] readme: Add general notes for testing deps [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852949 (owner: 10Vgutierrez) [16:28:20] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852950 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:28:23] (03CR) 10Vgutierrez: [C: 03+2] Release 0.35 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852951 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38048 and previous config saved to /var/cache/conftool/dbconfig/20221103-162904-ladsgroup.json [16:29:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:29:09] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:29:12] (03CR) 10Muehlenhoff: Enable profile::auto_restarts::service for jwt-authorizer on docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:29:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:29:24] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for jwt-authorizer on docker registry [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) [16:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T318605)', diff saved to https://phabricator.wikimedia.org/P38049 and previous config saved to /var/cache/conftool/dbconfig/20221103-162927-ladsgroup.json [16:29:34] (03CR) 10Clément Goubert: [C: 03+1] thumbor: reduce memory limit, add service nodeport [deployment-charts] - 10https://gerrit.wikimedia.org/r/852953 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:30:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318955)', diff saved to https://phabricator.wikimedia.org/P38050 and previous config saved to /var/cache/conftool/dbconfig/20221103-163016-ladsgroup.json [16:30:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [16:30:19] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:30:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [16:30:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T318955)', diff saved to https://phabricator.wikimedia.org/P38051 and previous config saved to /var/cache/conftool/dbconfig/20221103-163041-ladsgroup.json [16:31:09] (03Merged) 10jenkins-bot: limit jinja2, itsdangerous and werkzeug version [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852954 (owner: 10Vgutierrez) [16:31:17] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10jbond) [16:31:21] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10jbond) @odimitrijevic are you able to approve this access please (both as the manager and access to analytics-private data-users) thanks [16:31:33] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10jbond) p:05Triage→03Medium [16:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P38052 and previous config saved to /var/cache/conftool/dbconfig/20221103-163156-ladsgroup.json [16:31:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:32:01] PROBLEM - MediaWiki EtcdConfig up-to-date on parse1004 is CRITICAL: etcd last index (1236846) is outdated compared to the master one (1236849) https://wikitech.wikimedia.org/wiki/Etcd [16:32:01] PROBLEM - MediaWiki EtcdConfig up-to-date on parse1019 is CRITICAL: etcd last index (1236846) is outdated compared to the master one (1236849) https://wikitech.wikimedia.org/wiki/Etcd [16:32:01] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1441 is CRITICAL: etcd last index (1236846) is outdated compared to the master one (1236849) https://wikitech.wikimedia.org/wiki/Etcd [16:32:01] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1418 is CRITICAL: etcd last index (1236846) is outdated compared to the master one (1236849) https://wikitech.wikimedia.org/wiki/Etcd [16:32:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:32:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P38053 and previous config saved to /var/cache/conftool/dbconfig/20221103-163219-ladsgroup.json [16:32:59] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:21] RECOVERY - MediaWiki EtcdConfig up-to-date on parse1004 is OK: etcd last index (1236852) matches the master one (1236852) https://wikitech.wikimedia.org/wiki/Etcd [16:33:21] RECOVERY - MediaWiki EtcdConfig up-to-date on parse1019 is OK: etcd last index (1236852) matches the master one (1236852) https://wikitech.wikimedia.org/wiki/Etcd [16:33:22] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1441 is OK: etcd last index (1236852) matches the master one (1236852) https://wikitech.wikimedia.org/wiki/Etcd [16:33:22] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1418 is OK: etcd last index (1236852) matches the master one (1236852) https://wikitech.wikimedia.org/wiki/Etcd [16:33:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318955)', diff saved to https://phabricator.wikimedia.org/P38054 and previous config saved to /var/cache/conftool/dbconfig/20221103-163324-ladsgroup.json [16:34:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P38055 and previous config saved to /var/cache/conftool/dbconfig/20221103-163402-ladsgroup.json [16:34:59] (03Merged) 10jenkins-bot: api: support sha256 checksums [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852947 (owner: 10Vgutierrez) [16:35:01] (03Merged) 10jenkins-bot: api: Offer JSON for metadata if requested [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852948 (owner: 10Vgutierrez) [16:35:03] (03Merged) 10jenkins-bot: readme: Add general notes for testing deps [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852949 (owner: 10Vgutierrez) [16:35:05] (03CR) 10Hnowlan: [C: 03+2] thumbor: reduce memory limit, add service nodeport [deployment-charts] - 10https://gerrit.wikimedia.org/r/852953 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:35:48] (03Merged) 10jenkins-bot: acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852950 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:37:44] (03Merged) 10jenkins-bot: Release 0.35 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/852951 (https://phabricator.wikimedia.org/T244232) (owner: 10Vgutierrez) [16:38:15] (03PS1) 10Volans: sre.hosts.decommission: use mgmt IP if no DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) [16:38:45] (03Merged) 10jenkins-bot: thumbor: reduce memory limit, add service nodeport [deployment-charts] - 10https://gerrit.wikimedia.org/r/852953 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:39:24] (03CR) 10Volans: sre.hosts.decommission: use mgmt IP if no DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [16:39:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P38056 and previous config saved to /var/cache/conftool/dbconfig/20221103-163947-marostegui.json [16:40:33] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:40:45] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:41:32] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:42:26] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1019.eqiad.wmnet with OS bullseye [16:42:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host dbproxy1019.eqiad.wmnet with OS bullseye completed:... [16:42:37] (03CR) 10CI reject: [V: 04-1] sre.hosts.decommission: use mgmt IP if no DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [16:43:51] (03CR) 10Samtar: [C: 03+2] "beta cluster only, self-deploy, production noop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852946 (https://phabricator.wikimedia.org/T322325) (owner: 10Samtar) [16:45:23] (03Merged) 10jenkins-bot: [prod noop] InitialiseSettings-labs: Enable ContentTranslation on beta wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852946 (https://phabricator.wikimedia.org/T322325) (owner: 10Samtar) [16:45:56] (03CR) 10Ssingh: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/852918 (owner: 10Muehlenhoff) [16:48:14] !log fnegri@cumin1001 conftool action : set/pooled=yes; selector: name=dbproxy1019.eqiad.wmnet,service=wikireplicas-b [16:48:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P38062 and previous config saved to /var/cache/conftool/dbconfig/20221103-164833-ladsgroup.json [16:48:44] (03PS1) 10Hnowlan: Encode messages written to poolcounter stream [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) [16:48:53] (03CR) 10BCornwall: [C: 03+2] varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [16:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P38063 and previous config saved to /var/cache/conftool/dbconfig/20221103-164909-ladsgroup.json [16:49:12] !log fnegri@cumin1001 conftool action : set/pooled=no; selector: name=dbproxy1018.eqiad.wmnet,service=wikireplicas-b [16:51:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:52:09] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322347 (10isarantopoulos) [16:52:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:52:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:53:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:54:17] (03PS1) 10David Caro: wmcs: add socks proxy libraries [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [16:54:21] (03PS1) 10Dduvall: aptrepo: Add thirdparty/terraform [puppet] - 10https://gerrit.wikimedia.org/r/852961 (https://phabricator.wikimedia.org/T322344) [16:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P38065 and previous config saved to /var/cache/conftool/dbconfig/20221103-165456-marostegui.json [16:55:07] (03CR) 10LSobanski: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [16:55:17] (03CR) 10David Caro: "This just adds the libraries to setup the tunnel when/if needed." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [16:57:44] (03CR) 10CI reject: [V: 04-1] wmcs: add socks proxy libraries [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [17:00:04] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1700). [17:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P38066 and previous config saved to /var/cache/conftool/dbconfig/20221103-170003-ladsgroup.json [17:00:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:01:38] Nothing for developer-portal this week. I did a Striker deploy on Monday so it is caught up as well. I've got some Toolhub stuff that could roll out today, but I think I will let it sit until next week. [17:03:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322347 (10isarantopoulos) [17:03:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P38067 and previous config saved to /var/cache/conftool/dbconfig/20221103-170341-ladsgroup.json [17:04:06] (03CR) 10David Caro: wmcs: add socks proxy libraries (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [17:04:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P38068 and previous config saved to /var/cache/conftool/dbconfig/20221103-170417-ladsgroup.json [17:05:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T318605)', diff saved to https://phabricator.wikimedia.org/P38069 and previous config saved to /var/cache/conftool/dbconfig/20221103-170553-ladsgroup.json [17:05:57] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:06:04] PROBLEM - Disk space on conf1008 is CRITICAL: DISK CRITICAL - free space: / 2744 MB (3% inode=98%): /tmp 2744 MB (3% inode=98%): /var/tmp 2744 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1008&var-datasource=eqiad+prometheus/ops [17:08:31] (03CR) 10Vlad.shapik: [C: 04-1] Encode messages written to poolcounter stream (034 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:09:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10isarantopoulos) [17:10:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321123)', diff saved to https://phabricator.wikimedia.org/P38070 and previous config saved to /var/cache/conftool/dbconfig/20221103-171004-marostegui.json [17:10:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:10:09] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:10:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:10:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T321123)', diff saved to https://phabricator.wikimedia.org/P38071 and previous config saved to /var/cache/conftool/dbconfig/20221103-171028-marostegui.json [17:12:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321123)', diff saved to https://phabricator.wikimedia.org/P38072 and previous config saved to /var/cache/conftool/dbconfig/20221103-171250-marostegui.json [17:14:01] (03PS2) 10Vlad.shapik: requirements: add missing pycurl package [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453 (owner: 10Hnowlan) [17:15:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P38073 and previous config saved to /var/cache/conftool/dbconfig/20221103-171512-ladsgroup.json [17:17:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:17:50] (03PS1) 10Brennen Bearnes: scap targets: add phab1004.eqiad.wmnet [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/852965 (https://phabricator.wikimedia.org/T280597) [17:18:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318955)', diff saved to https://phabricator.wikimedia.org/P38074 and previous config saved to /var/cache/conftool/dbconfig/20221103-171850-ladsgroup.json [17:18:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:18:54] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:19:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:19:15] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:19:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318955)', diff saved to https://phabricator.wikimedia.org/P38075 and previous config saved to /var/cache/conftool/dbconfig/20221103-171925-ladsgroup.json [17:19:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:19:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [17:19:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:19:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:19:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:19:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [17:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T318955)', diff saved to https://phabricator.wikimedia.org/P38076 and previous config saved to /var/cache/conftool/dbconfig/20221103-171952-ladsgroup.json [17:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T318955)', diff saved to https://phabricator.wikimedia.org/P38077 and previous config saved to /var/cache/conftool/dbconfig/20221103-171959-ladsgroup.json [17:20:08] ack'd the NEL alert [17:20:29] the LVS alert resolved [17:21:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P38078 and previous config saved to /var/cache/conftool/dbconfig/20221103-172100-ladsgroup.json [17:22:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318955)', diff saved to https://phabricator.wikimedia.org/P38079 and previous config saved to /var/cache/conftool/dbconfig/20221103-172235-ladsgroup.json [17:23:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318955)', diff saved to https://phabricator.wikimedia.org/P38080 and previous config saved to /var/cache/conftool/dbconfig/20221103-172338-ladsgroup.json [17:25:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, I have signed the L3 document. [17:27:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:28:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P38081 and previous config saved to /var/cache/conftool/dbconfig/20221103-172759-marostegui.json [17:30:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P38082 and previous config saved to /var/cache/conftool/dbconfig/20221103-173022-ladsgroup.json [17:31:36] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, and 3 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10BCornwall) 05Open→03Resolved ` [~]$ curl -s -I https://en.wikipedia.org/ | grep Las... [17:38:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P38083 and previous config saved to /var/cache/conftool/dbconfig/20221103-173843-ladsgroup.json [17:39:28] !log `sudo truncate -s 20G /var/log/nginx/etcd_access.log.1` on conf100[7-9], root partition full [17:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:56] bblack: ^^ [17:41:02] PROBLEM - Check systemd state on conf1007 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:10] PROBLEM - etcd service on conf1007 is CRITICAL: CRITICAL - Expecting active but unit etcd is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:41:14] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 12 connections established with conf1007.eqiad.wmnet:4001 (min=73) https://wikitech.wikimedia.org/wiki/PyBal [17:41:20] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:41:22] PROBLEM - Etcd cluster health on conf1007 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [17:41:46] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10ILooremeta-WMF) @jbond I have just signed the agreement. [17:43:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P38084 and previous config saved to /var/cache/conftool/dbconfig/20221103-174306-marostegui.json [17:43:12] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 2 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [17:43:42] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 37 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [17:43:44] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:34] RECOVERY - Disk space on conf1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1007&var-datasource=eqiad+prometheus/ops [17:45:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:45:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:46:50] !log fnegri@cumin1001 conftool action : set/pooled=inactive; selector: name=dbproxy1018.eqiad.wmnet,service=wikireplicas-b [17:46:54] !log vgutierrez@conf1007:~$ sudo -i systemctl start etcd [17:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:10] RECOVERY - Etcd cluster health on conf1007 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [17:47:46] RECOVERY - Disk space on conf1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1008&var-datasource=eqiad+prometheus/ops [17:48:48] RECOVERY - Check systemd state on conf1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:49] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:48:56] RECOVERY - etcd service on conf1007 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:49:42] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:49:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:51:18] RECOVERY - Disk space on conf1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1009&var-datasource=eqiad+prometheus/ops [17:52:44] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [17:53:07] (03PS1) 10Jbond: P:contact: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/852982 [17:54:50] (03PS2) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [17:54:52] (03PS1) 10Jbond: differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) [17:54:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:55:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) I repooled the server using conftool: ` sudo confctl select "name=dbproxy1019.eqiad.wmnet,service=wikireplicas-b" set/pooled=yes fnegri... [17:59:12] PROBLEM - Check systemd state on snapshot1010 is CRITICAL: CRITICAL - degraded: The following units failed: fulldumps-rest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:29] (03PS3) 10Daniel Kinzler: Enable parsoid cache warming on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535) [17:59:48] PROBLEM - Check systemd state on snapshot1011 is CRITICAL: CRITICAL - degraded: The following units failed: fulldumps-rest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:04] jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1800). [18:00:21] (03PS2) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [18:01:09] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [18:01:28] PROBLEM - Check systemd state on snapshot1013 is CRITICAL: CRITICAL - degraded: The following units failed: fulldumps-rest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:28] PROBLEM - Check systemd state on snapshot1012 is CRITICAL: CRITICAL - degraded: The following units failed: fulldumps-rest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:16] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852986 (https://phabricator.wikimedia.org/T320513) [18:02:18] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852986 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [18:03:01] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852986 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [18:04:04] (03PS4) 10Daniel Kinzler: Enable parsoid cache warming on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535) [18:07:18] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.8 refs T320513 [18:07:21] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [18:09:24] !log lvs1020: restart pybal to hopefully clear etcd error states [18:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:10:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:10:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:11:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:12:21] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: PCC: change the exist code when no hosts are found - https://phabricator.wikimedia.org/T270757 (10jbond) 05Open→03Resolved a:03jbond [18:12:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) This is caused because of the host changing its IP. We need to update the grants on the clouddb* hosts for the new IP. I just applied... [18:13:06] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [18:13:22] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: Fix puppet-compiler redirects - https://phabricator.wikimedia.org/T264184 (10jbond) 05Open→03Resolved a:03jbond [18:13:55] (03PS5) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (https://phabricator.wikimedia.org/T222075) [18:14:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Papaul) 10.64.37.28/24 2620:0:861:119:10:64:37:28/64 [18:14:42] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:14:57] !log lvs1019: restart pybal to clear etcd error states [18:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:12] (03PS11) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (https://phabricator.wikimedia.org/T245828) [18:15:29] (03PS12) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (https://phabricator.wikimedia.org/T245828) [18:15:43] (03PS6) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (https://phabricator.wikimedia.org/T222075) [18:15:45] !log lvs1018: restart pybal to clear etcd error states [18:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:51] (03PS7) 10Jbond: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) [18:15:57] (03PS2) 10Jbond: differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) [18:16:11] !log lvs1017: restart pybal to clear etcd error states [18:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:40] 10SRE, 10Dumps-Generation, 10serviceops: conf* host ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:16:55] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (https://phabricator.wikimedia.org/T245828) (owner: 10Jbond) [18:17:04] 10SRE, 10Dumps-Generation, 10serviceops, 10Wikimedia-Incident: conf* host ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:17:19] (03CR) 10CI reject: [V: 04-1] differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) (owner: 10Jbond) [18:17:44] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:17:44] (03CR) 10CI reject: [V: 04-1] worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (https://phabricator.wikimedia.org/T222075) (owner: 10Jbond) [18:17:44] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [18:17:44] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 73 connections established with conf1007.eqiad.wmnet:4001 (min=73) https://wikitech.wikimedia.org/wiki/PyBal [18:17:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) I have fixed the rest and they are all now up. @fnegri can I still have the old ip so I can clean up the leftovers? ` root@dbproxy10... [18:18:58] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler: puppet-facts-export sometimes fails with 'trusted' fact not found - https://phabricator.wikimedia.org/T289335 (10jbond) 05Open→03Resolved We know run the export regularly via systemd timers and i don't think we see this issue any more but please... [18:19:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:20:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:20:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:20:12] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [18:20:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:20:16] 10SRE, 10Dumps-Generation, 10serviceops, 10Wikimedia-Incident: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:20:20] (03PS8) 10Jbond: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) [18:20:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:16] 10SRE, 10Dumps-Generation, 10serviceops, 10Wikimedia-Incident: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:22:34] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:23:08] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 02m 56s) [18:24:00] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) [18:24:32] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:25:07] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 3 others: Add support to run pcc on cloud and production hosts - https://phabricator.wikimedia.org/T295062 (10jbond) 05In progress→03Resolved This is complete now, please reopen if something i smissing [18:27:51] !log jynus@cumin1001 dbctl commit (dc=all): 'increase db1144:3315 load', diff saved to https://phabricator.wikimedia.org/P38086 and previous config saved to /var/cache/conftool/dbconfig/20221103-182750-jynus.json [18:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T318605)', diff saved to https://phabricator.wikimedia.org/P38087 and previous config saved to /var/cache/conftool/dbconfig/20221103-182756-ladsgroup.json [18:28:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:28:26] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [18:29:48] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [18:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:07] 10SRE, 10Dumps-Generation, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:36:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10CMacholan) @jbond Approved on my end. Thanks! [18:36:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10CMacholan) @jbond Approved on my end. Thanks! [18:37:25] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab_runner: enable restrict_firewall for Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [18:38:00] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: remove phab2001 from the list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/852264 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [18:38:24] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: switch phab2001 to phab2002 in commented line [dns] - 10https://gerrit.wikimedia.org/r/852266 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [18:40:05] (03PS2) 10Volans: sre.hosts.decommission: use mgmt IP if no DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) [18:40:10] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Self-merging for trial deploy to phab1004, per discussion in most recent GitLab IC sync:" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/852965 (https://phabricator.wikimedia.org/T280597) (owner: 10Brennen Bearnes) [18:40:15] (03PS1) 10Ladsgroup: WikiExporter: Avoid calling reload in processing every row [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852883 (https://phabricator.wikimedia.org/T298485) [18:41:29] jouncebot: nowandnext [18:41:29] For the next 1 hour(s) and 18 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T1800) [18:41:29] In 1 hour(s) and 18 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T2000) [18:42:04] (03PS1) 10Filippo Giunchedi: dispatch: sync user role and info from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) [18:42:39] (03CR) 10CI reject: [V: 04-1] dispatch: sync user role and info from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [18:43:43] (03PS2) 10Filippo Giunchedi: dispatch: sync user role and info from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) [18:43:49] (03CR) 10Ladsgroup: [C: 03+2] WikiExporter: Avoid calling reload in processing every row [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852883 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [18:46:48] (03CR) 10Filippo Giunchedi: "You might wonder about deleting users, I've inquired upstream here: https://github.com/Netflix/dispatch/discussions/2652 (tl;dr not suppor" [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [18:47:20] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:48:18] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:49:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @Marostegui see Papaul's comment above for the old IP :) [18:52:25] (03PS1) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [18:53:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852883 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [18:54:35] (03PS1) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [labs/private] - 10https://gerrit.wikimedia.org/r/852994 [18:54:46] (03CR) 10CI reject: [V: 04-1] prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [18:55:05] (03PS2) 10Jbond: P:contact: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/852982 [18:55:07] (03PS1) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [puppet] - 10https://gerrit.wikimedia.org/r/852995 [18:55:39] (03PS2) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [18:56:32] (03CR) 10CI reject: [V: 04-1] used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [puppet] - 10https://gerrit.wikimedia.org/r/852995 (owner: 10Jbond) [18:58:38] (03CR) 10jenkins-bot: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [18:59:17] (03Merged) 10jenkins-bot: WikiExporter: Avoid calling reload in processing every row [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852883 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [18:59:29] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] [18:59:35] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [18:59:35] T322360: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 [18:59:48] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [18:59:51] (03PS2) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [puppet] - 10https://gerrit.wikimedia.org/r/852995 [19:00:38] (03PS3) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [puppet] - 10https://gerrit.wikimedia.org/r/852995 [19:00:40] (03CR) 10CI reject: [V: 04-1] used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [puppet] - 10https://gerrit.wikimedia.org/r/852995 (owner: 10Jbond) [19:02:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:02:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318605)', diff saved to https://phabricator.wikimedia.org/P38088 and previous config saved to /var/cache/conftool/dbconfig/20221103-190258-ladsgroup.json [19:03:02] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:03:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:03:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:03:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:03:53] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] (duration: 04m 24s) [19:04:28] (03PS3) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [19:04:44] PROBLEM - Host wcqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:12] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [19:05:48] RECOVERY - Host wcqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [19:06:22] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 01m 10s) [19:06:30] (03CR) 10CI reject: [V: 04-1] prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [19:15:07] (03PS1) 10Andrew Bogott: codfw1dev: openstack version bumps [puppet] - 10https://gerrit.wikimedia.org/r/852997 (https://phabricator.wikimedia.org/T322359) [19:15:47] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: openstack version bumps [puppet] - 10https://gerrit.wikimedia.org/r/852997 (https://phabricator.wikimedia.org/T322359) (owner: 10Andrew Bogott) [19:16:29] (03CR) 10Dzahn: [C: 03+2] phabricator: switch phab2001 to phab2002 in commented line [dns] - 10https://gerrit.wikimedia.org/r/852266 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:16:32] (03PS2) 10Dzahn: phabricator: switch phab2001 to phab2002 in commented line [dns] - 10https://gerrit.wikimedia.org/r/852266 (https://phabricator.wikimedia.org/T322250) [19:18:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P38089 and previous config saved to /var/cache/conftool/dbconfig/20221103-191805-ladsgroup.json [19:19:45] (03PS1) 10Andrew Bogott: Add files and templates for Horizon/Zed [puppet] - 10https://gerrit.wikimedia.org/r/852998 (https://phabricator.wikimedia.org/T322359) [19:20:26] (03CR) 10CI reject: [V: 04-1] Add files and templates for Horizon/Zed [puppet] - 10https://gerrit.wikimedia.org/r/852998 (https://phabricator.wikimedia.org/T322359) (owner: 10Andrew Bogott) [19:22:06] (03PS4) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [19:22:09] !log brennen@deploy1002 Started deploy [phabricator/deployment@ea0ffa7]: initial deploy to phab1004 [19:22:21] (03PS2) 10Andrew Bogott: Add files and templates for Horizon/Zed [puppet] - 10https://gerrit.wikimedia.org/r/852998 (https://phabricator.wikimedia.org/T322359) [19:22:34] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ea0ffa7]: initial deploy to phab1004 (duration: 00m 25s) [19:23:13] (03PS5) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [19:23:15] (03PS3) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [19:25:12] (03CR) 10CI reject: [V: 04-1] 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 (owner: 10Jbond) [19:25:42] (03PS3) 10Jbond: differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) [19:25:45] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [19:25:51] (03PS6) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [19:25:57] (03PS4) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [19:26:18] (03CR) 10Dzahn: "noop on clouddumps1002 https://puppet-compiler.wmflabs.org/pcc-worker1002/37946/clouddumps1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/852256 (owner: 10Dzahn) [19:26:20] (03CR) 10Dzahn: [C: 03+2] dumps: datasets/fetcher, add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852256 (owner: 10Dzahn) [19:26:33] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 48s) [19:27:20] (03PS1) 10Andrew Bogott: codfw1dev horizon back to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/853000 (https://phabricator.wikimedia.org/T322359) [19:27:26] (03CR) 10CI reject: [V: 04-1] prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [19:27:28] (03CR) 10jenkins-bot: differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) (owner: 10Jbond) [19:27:37] (03PS5) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [19:28:49] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [19:29:18] (03CR) 10Andrew Bogott: [C: 03+2] Add files and templates for Horizon/Zed [puppet] - 10https://gerrit.wikimedia.org/r/852998 (https://phabricator.wikimedia.org/T322359) (owner: 10Andrew Bogott) [19:29:37] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 48s) [19:30:32] (03CR) 10Dzahn: dumps/distribution: move hardcoded host names to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:32:03] (03CR) 10Dzahn: "@nskaggs @hokwelum @ArielGlenn - There is no cloud version of cloud dumps, right? as in "changes in production hiera must be reflected in " [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:33:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P38090 and previous config saved to /var/cache/conftool/dbconfig/20221103-193311-ladsgroup.json [19:37:41] !log brennen@deploy1002 Started deploy [phabricator/deployment@ea0ffa7]: initial deploy to phab1004 [19:40:06] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ea0ffa7]: initial deploy to phab1004 (duration: 02m 24s) [19:43:20] (03PS1) 10Jdlrobson: Update lv and bn wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) [19:43:55] (03PS3) 10Dzahn: phabricator: remove phab2001 from the list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/852264 (https://phabricator.wikimedia.org/T322250) [19:46:33] (03CR) 10Andrew Bogott: [C: 04-1] "I'm investigating more here... hoping this is actually not needed." [puppet] - 10https://gerrit.wikimedia.org/r/853000 (https://phabricator.wikimedia.org/T322359) (owner: 10Andrew Bogott) [19:47:09] (03CR) 10Dzahn: [C: 03+2] phabricator: remove phab2001 from the list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/852264 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:47:38] (03PS1) 10Ryan Kemper: Revert "Revert "query_service: Ensure prometheus exporter depends on blazegraph service"" [puppet] - 10https://gerrit.wikimedia.org/r/852885 [19:48:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T318605)', diff saved to https://phabricator.wikimedia.org/P38091 and previous config saved to /var/cache/conftool/dbconfig/20221103-194818-ladsgroup.json [19:48:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [19:48:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:48:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [19:48:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T318605)', diff saved to https://phabricator.wikimedia.org/P38092 and previous config saved to /var/cache/conftool/dbconfig/20221103-194839-ladsgroup.json [19:50:35] (03PS1) 10JHathaway: aux-k8s: initial values [deployment-charts] - 10https://gerrit.wikimedia.org/r/853004 (https://phabricator.wikimedia.org/T321120) [19:51:05] (03PS2) 10Ryan Kemper: query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/852885 (https://phabricator.wikimedia.org/T322037) [19:51:18] (03CR) 10Dzahn: [C: 03+2] "firewall and rsync changes etc on phab hosts deployed, no issues" [puppet] - 10https://gerrit.wikimedia.org/r/852264 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:52:18] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37947/console" [puppet] - 10https://gerrit.wikimedia.org/r/852885 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [19:52:22] (03PS1) 10Samtar: [prod noop] InitialiseSettings-labs: Add `recommend.wmflabs` to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853005 (https://phabricator.wikimedia.org/T322323) [19:52:36] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/852885 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [19:52:48] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10elukey) Not sure if in scope with the task, but we should add monitoring to a metric like https://grafana.wiki... [19:53:30] (03CR) 10Dzahn: [C: 03+1] "yea, we use the discovery DNS name for this:" [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:53:59] (03CR) 10Bking: [C: 03+1] query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/852885 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [19:54:04] (03CR) 10Dzahn: [C: 03+2] "the real fix is to create aphlict2001 VM to have it around so that the discovery name could be pointed to it if needed" [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:54:21] (03PS4) 10Dzahn: delete varnish service alias for phab2001-aphlict [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) [19:54:23] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/852885 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [19:56:07] !log FOO Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/852885; disabled puppet on query service fleet via `ryankemper@cumin1001:~$ sudo -E cumin 'A:wcqs-public or A:wdqs-all' 'sudo disable-puppet "T322037"'`; testing change on `wdqs1009` [19:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:11] T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service - https://phabricator.wikimedia.org/T322037 [19:56:35] (03CR) 10Dzahn: "thanks bblack for doing research on this" [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [20:00:04] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221103T2000). [20:00:05] Jdlrobson, bwang, and duesen: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] * TheresNoTime can deploy [20:00:22] o/ [20:00:47] \o [20:00:53] I'll start with 852882 then Jdlrobson :) [20:01:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852882 (owner: 10Jdlrobson) [20:02:18] * duesen blinks [20:02:30] when did I get kicked out of this channel?... [20:02:41] (03CR) 10JHathaway: [C: 03+2] aux-k8s: initial values [deployment-charts] - 10https://gerrit.wikimedia.org/r/853004 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:02:46] >.> [20:04:42] TheresNoTime: i missed the start of the deployment... is anything going on right now? Can I get started? [20:05:02] duesen: ah yes, https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/852882 is currently merging [20:05:14] RECOVERY - Check systemd state on snapshot1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:15] ok [20:05:25] let me know when you need me :) [20:05:43] (03CR) 10Aftab: "If possible, wait for https://phabricator.wikimedia.org/T319223#8368048 to fix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:05:49] TheresNoTime: could I do duesen 's patch for some backport training? :) [20:06:14] thcipriani: sure :) that vector patch has ~10 minutes left before it merges, could do it now? [20:06:26] with scap backport, it's soooo easy now :) [20:06:32] oh! great :) [20:06:36] RECOVERY - Check systemd state on snapshot1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:36] !log samtar@deploy1002 backport aborted: (duration: 05m 23s) [20:06:49] thcipriani: all yours, I cancelled that ^ [20:06:55] (03PS1) 10Ryan Kemper: query_service: make blazegraph exporter sleep before starting [puppet] - 10https://gerrit.wikimedia.org/r/853006 (https://phabricator.wikimedia.org/T322037) [20:06:56] RECOVERY - Check systemd state on snapshot1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:56] RECOVERY - Check systemd state on snapshot1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:12] (03CR) 10Bking: [C: 03+1] query_service: make blazegraph exporter sleep before starting [puppet] - 10https://gerrit.wikimedia.org/r/853006 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [20:08:41] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37948/console" [puppet] - 10https://gerrit.wikimedia.org/r/853006 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [20:09:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [20:09:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10Dzahn) a:03MarkTraceur [20:10:14] (03Merged) 10jenkins-bot: Enable parsoid cache warming on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [20:10:27] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:843955|Enable parsoid cache warming on testwiki. (T320535)]] [20:10:30] T320535: Put Parsoid output into the ParserCache on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320535 [20:10:46] !log thcipriani@deploy1002 thcipriani and daniel: Backport for [[gerrit:843955|Enable parsoid cache warming on testwiki. (T320535)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:11:12] duesen: your change is on mwdebug, check please [20:11:29] also, heartening to hear you say scap backport has made deployment easier <3 [20:12:40] (03CR) 10Jdlrobson: Update lv and bn wordmarks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:12:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Dzahn) Hi @AnnWF Friendly ping that this access request is currently waiting for a step from you to sign L3. Everything else looks done with the NDA and approval. [20:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318605)', diff saved to https://phabricator.wikimedia.org/P38093 and previous config saved to /var/cache/conftool/dbconfig/20221103-201303-ladsgroup.json [20:13:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:13:35] thcipriani: hold on... [20:13:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10Dzahn) [20:14:52] holding [20:14:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:14:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10Dzahn) Welcome @isarantopoulos to WMF. Confirmed you have already signed L3 and looks like you prov... [20:15:41] (03PS1) 10Ryan Kemper: kibana: remove resource needed only for kibana 5 [puppet] - 10https://gerrit.wikimedia.org/r/853007 (https://phabricator.wikimedia.org/T322358) [20:15:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:15:50] (03Merged) 10jenkins-bot: Finish moving to Page Tools naming convention [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852882 (owner: 10Jdlrobson) [20:15:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:16:05] thcipriani: i can still edit testwiki, so seems wto work :) [20:16:17] oh good :) [20:16:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10Dzahn) @jbond Confirmed employee via Namely, fwiw [20:16:28] (03PS2) 10Ryan Kemper: kibana: remove resource needed only for kibana 5 [puppet] - 10https://gerrit.wikimedia.org/r/853007 (https://phabricator.wikimedia.org/T322358) [20:16:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10Dzahn) 05Open→03In progress [20:16:31] alright, going live then [20:16:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:16:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/853007 (https://phabricator.wikimedia.org/T322358) (owner: 10Ryan Kemper) [20:17:22] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10Dzahn) a:03odimitrijevic [20:18:10] duesen: FYI, there were some warnings on testwiki, may need some investigating: https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 [20:18:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Dzahn) [20:18:59] (03CR) 10Bking: [C: 03+1] kibana: remove resource needed only for kibana 5 [puppet] - 10https://gerrit.wikimedia.org/r/853007 (https://phabricator.wikimedia.org/T322358) (owner: 10Ryan Kemper) [20:19:50] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37949/console" [puppet] - 10https://gerrit.wikimedia.org/r/853007 (https://phabricator.wikimedia.org/T322358) (owner: 10Ryan Kemper) [20:20:03] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: make blazegraph exporter sleep before starting [puppet] - 10https://gerrit.wikimedia.org/r/853006 (https://phabricator.wikimedia.org/T322037) (owner: 10Ryan Kemper) [20:20:57] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:843955|Enable parsoid cache warming on testwiki. (T320535)]] (duration: 10m 30s) [20:21:00] T320535: Put Parsoid output into the ParserCache on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320535 [20:21:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Dzahn) confirmed L3 signature @jbond fwiw, can't find on Namely though, unlike other users on current requests [20:21:10] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] kibana: remove resource needed only for kibana 5 [puppet] - 10https://gerrit.wikimedia.org/r/853007 (https://phabricator.wikimedia.org/T322358) (owner: 10Ryan Kemper) [20:21:13] ^ duesen should be live everywhere now [20:21:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:22:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) [20:22:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:22:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:23:12] thcipriani: thank you! [20:23:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:23:36] you're welcome :) [20:25:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) - confirmed L3 signature - confirmed in Namely (@Hghani I see you have a wikimedia -ctr email address, it's used here in Phabricator and I can see... [20:26:03] hm, IRCCloud maybe having issues. thcipriani: am I good to continue? [20:26:29] tnt-wmf: I think you can go ahead [20:26:31] tnt-wmf: yep, all clear from here, thank you tnt-wmf for allowing me to butt in [20:26:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) @jbond found in Namely but with unknown job title / manager. so the usual verification step might be missing, not sure. [20:26:51] no worries! :) Jdlrobson, resuming 852882 [20:27:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852882 (owner: 10Jdlrobson) [20:27:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) 05Open→03In progress [20:27:14] !log samtar@deploy1002 Started scap: Backport for [[gerrit:852882|Finish moving to Page Tools naming convention]] [20:27:34] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:852882|Finish moving to Page Tools naming convention]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:27:47] Jdlrobson: that's live on mwdebug now, can you test? [20:28:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P38094 and previous config saved to /var/cache/conftool/dbconfig/20221103-202810-ladsgroup.json [20:28:13] tnt-wmf: hi, can you please ping me when done with deployment? 🙂 [20:28:34] urbanecm: sure :) [20:28:38] thanks [20:28:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Dzahn) 05Open→03In progress [20:28:52] tnt-wmf: looking [20:29:25] would you mind if i +2 my backports now, to save time on CI? (it's fine if you'd like me not to do that) [20:29:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10Dzahn) a:03calbon [20:29:34] tnt-wmf: perfect! please sync! [20:29:39] Jdlrobson: syncin' [20:29:55] urbanecm: sure, go ahead :) there's only a quick config patch after this vector one [20:30:02] sounds good! [20:30:13] (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Do not include explanatory text on transclusion [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852172 (https://phabricator.wikimedia.org/T321773) (owner: 10Urbanecm) [20:30:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10MarkTraceur) Approved. [20:30:33] (03PS2) 10Samtar: Update lv and bn wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:30:49] (03PS1) 10Urbanecm: ApiSetMenteeStatus: Check GEMentorshipEnabled in wiki config [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853026 (https://phabricator.wikimedia.org/T321805) [20:30:50] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:04] (03CR) 10Urbanecm: [C: 03+2] ApiSetMenteeStatus: Check GEMentorshipEnabled in wiki config [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853026 (https://phabricator.wikimedia.org/T321805) (owner: 10Urbanecm) [20:31:47] Jdlrobson: T322372, didn't show on mwdebug, looks related [20:31:47] T322372: ConfigException: GlobalVarConfig::get: undefined option: 'VectorArticleTools' - https://phabricator.wikimedia.org/T322372 [20:32:54] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2326 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:33:11] tnt-wmf: ^^, that alert seems not good [20:33:22] yeah, preparing to revert [20:33:49] though it has now stopped, not sure if it was "just" during sync [20:34:17] it's a spike only, but a big one [20:34:37] tnt-wmf: looking.. [20:34:48] I briefly got https://cdn.discordapp.com/attachments/1024122449035526205/1037826504660357200/unknown.png and so did someone else on Discord, but working now [20:34:52] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:35:00] yeah, seems it was a spike [20:35:01] VectorArticleTools should have been renamed to VectorPageTools [20:35:04] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:852882|Finish moving to Page Tools naming convention]] (duration: 07m 50s) [20:35:05] ack [20:35:17] that would be on the code path for every request [20:35:20] Possible lag? [20:35:23] NovemLinguae: thanks for the report, there was a spike of errors for some reason, but it seems seems to be over now. [20:35:29] !log T322037 Rolling changes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/852885 and https://gerrit.wikimedia.org/r/853006 out to query service fleet, 4 hosts at a time: `ryankemper@cumin1001:~$ sudo -E cumin -b 4 'A:wcqs-public or A:wdqs-all' 'run-puppet-agent --force'` [20:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:31] T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service - https://phabricator.wikimedia.org/T322037 [20:35:41] that would make a big spike if the config var was missing for a few seconds due to extension reg cache or something [20:35:48] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:35:50] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method= [20:35:53] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/852882/3/skin.json [20:36:00] ConfigException: GlobalVarConfig::get: undefined option: 'VectorArticleTools' [20:36:49] nothing since `:31`, are we happy to not revert? [20:36:50] looks like syncs are sometimes still order-dependent [20:36:59] yeh i think this is a sync problem [20:37:01] TheresNoTime: +1 to not reverting [20:37:03] urbanecm: tnt-wmf I see no reference to VectorArticleTools in code [20:37:10] yeah, I was just commenting on the cause [20:37:30] and thcipriani fyi ^^ (files got synced out-of-order and it caused an exception spike) [20:37:39] Okay, not going to revert. Jdlrobson, will be moving to 853003 now :) [20:37:47] thanks! [20:38:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:38:03] huh, well that's interesting. [20:38:15] T322372 fwiw [20:38:16] T322372: ConfigException: GlobalVarConfig::get: undefined option: 'VectorArticleTools' - https://phabricator.wikimedia.org/T322372 [20:38:20] indeed. i thought that should no longer happen. [20:38:21] I didn't think that was possible for appservers anymore [20:38:30] ^ dancy , FYI [20:38:46] (03PS1) 10JHathaway: aux-k8s: env config [deployment-charts] - 10https://gerrit.wikimedia.org/r/853009 (https://phabricator.wikimedia.org/T321120) [20:38:51] so this was code still looking for the old config setting? [20:38:59] it looks like it [20:39:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:12] (03Merged) 10jenkins-bot: Update lv and bn wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:39:26] !log samtar@deploy1002 Started scap: Backport for [[gerrit:853003|Update lv and bn wordmarks (T319223)]] [20:39:29] T319223: [XL] Deploy new set of logos for all Wikipedias except Gothic Wikipedia - https://phabricator.wikimedia.org/T319223 [20:39:34] hm... if the extension registry cache was to blame, it would be the other way around... code looking for the new config, but no default defined for ti... so that's not it. [20:39:44] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:39:45] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:853003|Update lv and bn wordmarks (T319223)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:39:57] Jdlrobson: can you test that on mwdebug? [20:40:54] (and process question, Ctrl+C cancelling during the php restart stage to do a revert is reasonable, yes? In this case it was 90% done already so I'd wait regardless) [20:41:30] TheresNoTime: looking [20:41:39] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:41:58] (03CR) 10Aftab: Update lv and bn wordmarks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853003 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:41:58] TheresNoTime: LGTM [20:42:02] syncin' [20:42:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:52] (03CR) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [20:43:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P38096 and previous config saved to /var/cache/conftool/dbconfig/20221103-204316-ladsgroup.json [20:43:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:44:00] TheresNoTime: i think cancelling the sync at any time should be OK, if you're going to revert it later (only cancelling can leave the cluster in a confused state) [20:44:14] makes sense, thank you [20:44:40] np [20:44:46] (03PS2) 10Cwhite: hiera: all eqiad and codfw logging clusters to opensearch v2 [puppet] - 10https://gerrit.wikimedia.org/r/828111 (https://phabricator.wikimedia.org/T304440) [20:44:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:44:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:45:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:46:03] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:853003|Update lv and bn wordmarks (T319223)]] (duration: 06m 36s) [20:46:05] T319223: [XL] Deploy new set of logos for all Wikipedias except Gothic Wikipedia - https://phabricator.wikimedia.org/T319223 [20:46:27] Jdlrobson: they're live and the cache has been purged :) [20:46:29] Hmm.. I guess the problem in this case is that skin.json is not PHP code, so whatever code is reading that file will run for every relevant request, [20:46:31] urbanecm: all yours [20:46:34] (03CR) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [20:46:41] thanks TheresNoTime! [20:46:45] * urbanecm is blocked on CI [20:46:59] oh, https://integration.wikimedia.org/zuul/ has a new interface! [20:47:24] (03CR) 10JHathaway: [C: 03+2] aux-k8s: env config [deployment-charts] - 10https://gerrit.wikimedia.org/r/853009 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:47:27] TheresNoTime: thanks for all your help today! [20:47:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853026 (https://phabricator.wikimedia.org/T321805) (owner: 10Urbanecm) [20:47:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852172 (https://phabricator.wikimedia.org/T321773) (owner: 10Urbanecm) [20:47:42] Jdlrobson: you're welcome! :) [20:47:47] and yes urbanecm, I quite like it! [20:47:53] * urbanecm too [20:47:54] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:48:16] dancy: yeah, whatever is turning the variables in the json into global(ish) php variables and where that happens relative to where the code that reads that variable is being called at execution time. [20:49:15] Nod. The change to skin.json is immediately live while the change to includes/Constants.php (which update the reference to the entry in skin.json) is still running the old code until restart. [20:49:29] (03Merged) 10jenkins-bot: SpecialManageMentors: Do not include explanatory text on transclusion [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852172 (https://phabricator.wikimedia.org/T321773) (owner: 10Urbanecm) [20:49:32] (03Merged) 10jenkins-bot: ApiSetMenteeStatus: Check GEMentorshipEnabled in wiki config [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853026 (https://phabricator.wikimedia.org/T321805) (owner: 10Urbanecm) [20:49:32] old code and there using the old (removed) reference [20:49:53] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853026|ApiSetMenteeStatus: Check GEMentorshipEnabled in wiki config (T321805)]], [[gerrit:852172|SpecialManageMentors: Do not include explanatory text on transclusion (T321773)]] [20:49:53] oh! [20:49:57] T321805: Mentees cannot opt out from mentorship anymore - https://phabricator.wikimedia.org/T321805 [20:49:58] T321773: Way to hide preview text when placing Special:ManageMentors on a page - https://phabricator.wikimedia.org/T321773 [20:49:58] that makes sense [20:50:00] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [20:50:01] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [20:50:14] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:853026|ApiSetMenteeStatus: Check GEMentorshipEnabled in wiki config (T321805)]], [[gerrit:852172|SpecialManageMentors: Do not include explanatory text on transclusion (T321773)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:50:18] is that true? The json gets re-read on requests? [20:50:28] duesen: do you know? [20:50:47] (03PS2) 10Samtar: [prod noop] InitialiseSettings-labs: Add `recommend.wmflabs` to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853005 (https://phabricator.wikimedia.org/T322323) [20:51:16] I guess it could be another mtime cache that exists...somewhere [20:51:16] urbanecm: would you mind +2ing ^ (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/853005) when you're done? It's beta-only [20:52:41] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:52:46] (03CR) 10Urbanecm: [C: 03+2] [prod noop] InitialiseSettings-labs: Add `recommend.wmflabs` to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853005 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [20:52:48] not at all [20:52:54] ta [20:52:58] np [20:53:32] (03Merged) 10jenkins-bot: [prod noop] InitialiseSettings-labs: Add `recommend.wmflabs` to CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853005 (https://phabricator.wikimedia.org/T322323) (owner: 10Samtar) [20:55:36] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:55:38] (03PS1) 10Dzahn: remove phab1001-aphlict.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) [20:55:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:55:54] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:55:58] (03PS3) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [20:56:23] (03PS2) 10Dzahn: remove phab1001-aphlict.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) [20:56:27] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853026|ApiSetMenteeStatus: Check GEMentorshipEnabled in wiki config (T321805)]], [[gerrit:852172|SpecialManageMentors: Do not include explanatory text on transclusion (T321773)]] (duration: 06m 34s) [20:56:31] T321805: Mentees cannot opt out from mentorship anymore - https://phabricator.wikimedia.org/T321805 [20:56:31] T321773: Way to hide preview text when placing Special:ManageMentors on a page - https://phabricator.wikimedia.org/T321773 [20:56:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:56:40] * urbanecm done [20:56:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:56:59] (03CR) 10CI reject: [V: 04-1] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:57:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 438 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:57:04] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) >>! In T322360#8368215, @elukey wrote: > Not sure if in scope with the task, but we should add monito... [20:57:18] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [20:57:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:57:36] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T318605)', diff saved to https://phabricator.wikimedia.org/P38097 and previous config saved to /var/cache/conftool/dbconfig/20221103-205823-ladsgroup.json [20:58:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [20:58:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:58:45] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:58:47] urbanecm: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2022.11.03?id=tDdHP4QBW_7Siu4Bq4xT related to your patch? [20:58:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [20:58:56] looking [20:58:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T318605)', diff saved to https://phabricator.wikimedia.org/P38098 and previous config saved to /var/cache/conftool/dbconfig/20221103-205855-ladsgroup.json [20:59:02] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:59:10] (03PS4) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [21:00:00] (03CR) 10CI reject: [V: 04-1] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [21:00:18] (03PS3) 10Dzahn: remove phab1001-aphlict.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) [21:00:20] TheresNoTime: yes, and it's the same issue, but with extensions.json. [21:00:24] *extension.json [21:00:41] ah, seems to have stopped yeah [21:00:45] aka extension.json passes an additional parameter to the API's constructor immediately [21:01:01] but ApiSetMenteeStatus only recognizes that after php-fpm-restart finishes its job [21:01:37] i just double-checked the relevant UI, and no bugs this time :) [21:01:47] ^^ [21:01:48] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) Adding @daniel as I believe this was the problematic patch, 5b0b54599bfd, but I am not 100% sure, bec... [21:02:20] (03PS1) 10JHathaway: aux-k8s: add aux-k8s to kubernetes_cluster_groups [puppet] - 10https://gerrit.wikimedia.org/r/853011 (https://phabricator.wikimedia.org/T321137) [21:02:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:03:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:03:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:04:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:05:40] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add aux-k8s to kubernetes_cluster_groups [puppet] - 10https://gerrit.wikimedia.org/r/853011 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [21:07:44] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:08:17] !log [WCQS] Pooled `wcqs100[1,2]` [21:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:26] (03Abandoned) 10Urbanecm: SpecialManageMentors: Do not include explanatory text on transclusion [extensions/GrowthExperiments] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/852173 (https://phabricator.wikimedia.org/T321773) (owner: 10Urbanecm) [21:09:30] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:09:30] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10ArielGlenn) There was a config setting that turned it on for November. See https://gerrit.wikimedia.org/r/c/op... [21:12:19] (03PS2) 10Jdlrobson: Remove logo setting in YAML files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 [21:12:39] (03CR) 10Jdlrobson: "Does removing this make sense or is this part of a grander plan?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 (owner: 10Jdlrobson) [21:19:15] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:31] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:19:32] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:19:38] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:19:39] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:21:07] (03CR) 10ArielGlenn: dumps/distribution: move hardcoded host names to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:22:04] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:22:06] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:22:28] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:22:30] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:24:20] (03CR) 10Dzahn: dumps/distribution: move hardcoded host names to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:24:21] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [21:26:52] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 02m 31s) [21:28:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318605)', diff saved to https://phabricator.wikimedia.org/P38099 and previous config saved to /var/cache/conftool/dbconfig/20221103-212810-ladsgroup.json [21:28:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:28:57] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) Thank you, Ariel! [21:29:48] (03CR) 10Cwhite: [C: 03+2] hiera: all eqiad and codfw logging clusters to opensearch v2 [puppet] - 10https://gerrit.wikimedia.org/r/828111 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [21:29:50] (03PS1) 10Hashar: Import gerrit-theme.js history from Puppet [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853025 (https://phabricator.wikimedia.org/T319378) [21:34:16] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:31] (03CR) 10Brennen Bearnes: [C: 03+1] "I don't see anything obvious pointing to it, at least." [dns] - 10https://gerrit.wikimedia.org/r/853010 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:36:28] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:38:43] (03CR) 10ArielGlenn: dumps/distribution: move hardcoded host names to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:39:35] (03CR) 10Dzahn: dumps/distribution: move hardcoded host names to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:40:58] (03PS1) 10Dzahn: site/phabricator: move phab2001 from prod to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/853051 (https://phabricator.wikimedia.org/T322250) [21:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:43:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P38100 and previous config saved to /var/cache/conftool/dbconfig/20221103-214317-ladsgroup.json [21:44:19] (03PS1) 10Hashar: Import gerrit-theme.js history from Puppet [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853052 (https://phabricator.wikimedia.org/T319378) [21:45:20] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [21:45:50] (03CR) 10Dzahn: dumps/distribution: move hardcoded host names to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:46:10] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 50s) [21:47:32] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [21:47:38] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 05s) [21:47:52] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [21:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:48:42] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 50s) [21:50:18] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:51:35] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [21:51:47] (03PS1) 10Hashar: Move test result table to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853056 (https://phabricator.wikimedia.org/T319378) [21:51:52] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 16s) [21:51:53] (03PS1) 10Hashar: Move custom CSS style to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853057 (https://phabricator.wikimedia.org/T319378) [21:51:59] (03PS1) 10Hashar: Move custom links to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853058 (https://phabricator.wikimedia.org/T319378) [21:52:08] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P38101 and previous config saved to /var/cache/conftool/dbconfig/20221103-215823-ladsgroup.json [22:06:27] (03CR) 10Jforrester: Remove logo setting in YAML files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 (owner: 10Jdlrobson) [22:11:02] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [22:11:10] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 08s) [22:11:44] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [22:13:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T318605)', diff saved to https://phabricator.wikimedia.org/P38102 and previous config saved to /var/cache/conftool/dbconfig/20221103-221329-ladsgroup.json [22:13:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:13:33] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:13:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:13:52] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 02m 07s) [22:30:13] (03PS1) 10Hashar: gerrit: remove gerrit-theme.js [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) [22:35:44] (03CR) 10Jdlrobson: Remove logo setting in YAML files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 (owner: 10Jdlrobson) [22:40:50] !log logstash eqiad - opensearch 2.2.0 upgrade complete T304440 [22:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:54] T304440: Test and upgrade OpenSearch to 2.2.0 - https://phabricator.wikimedia.org/T304440 [22:42:52] !log krinkle@deploy1002 Started deploy [integration/docroot@44f1640]: (no justification provided) [22:43:21] !log krinkle@deploy1002 Finished deploy [integration/docroot@44f1640]: (no justification provided) (duration: 00m 29s) [22:44:47] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [22:45:47] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 01m 00s) [22:51:29] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) @BBlack @KOfori Hi both! Could we bother you for another look at this task? A long time has passed since it originally was filed. The...