[00:04:02] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P40699 and previous config saved to /var/cache/conftool/dbconfig/20221123-001147-marostegui.json [00:12:54] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:14:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1004.eqiad.wmnet with OS bullseye [00:14:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov1004.eqiad.wmnet with OS bullseye completed: - dbprov1004 (**WARN**)... [00:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:24:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:26:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321130)', diff saved to https://phabricator.wikimedia.org/P40700 and previous config saved to /var/cache/conftool/dbconfig/20221123-002654-marostegui.json [00:26:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2130.codfw.wmnet with reason: Maintenance [00:27:01] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:27:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2130.codfw.wmnet with reason: Maintenance [00:27:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321130)', diff saved to https://phabricator.wikimedia.org/P40701 and previous config saved to /var/cache/conftool/dbconfig/20221123-002716-marostegui.json [00:39:07] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321130)', diff saved to https://phabricator.wikimedia.org/P40702 and previous config saved to /var/cache/conftool/dbconfig/20221123-004005-marostegui.json [00:40:11] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:45:37] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P40703 and previous config saved to /var/cache/conftool/dbconfig/20221123-005511-marostegui.json [00:59:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [00:59:46] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2041.codfw.wmnet with OS bullseye [00:59:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye [00:59:55] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Removed from Pu... [01:00:55] !log sudo rm /etc/dhcp/automation/ttyS1-115200/cp2041.conf [01:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [01:01:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye [01:04:57] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:10:15] (03PS3) 10Krinkle: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) [01:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P40704 and previous config saved to /var/cache/conftool/dbconfig/20221123-011018-marostegui.json [01:10:25] (03CR) 10Krinkle: [C: 03+2] build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [01:11:06] (03Merged) 10jenkins-bot: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [01:11:31] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye [01:11:38] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Removed from Pu... [01:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:15:57] PROBLEM - Host cp2042 is DOWN: PING CRITICAL - Packet loss = 100% [01:16:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [01:20:55] RECOVERY - Host cp2042 is UP: PING WARNING - Packet loss = 75%, RTA = 33.20 ms [01:25:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321130)', diff saved to https://phabricator.wikimedia.org/P40705 and previous config saved to /var/cache/conftool/dbconfig/20221123-012524-marostegui.json [01:25:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [01:25:31] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:25:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [01:29:25] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye [01:29:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [01:36:06] (03PS2) 10Krinkle: admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog) [01:36:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance [01:36:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance [01:36:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321130)', diff saved to https://phabricator.wikimedia.org/P40706 and previous config saved to /var/cache/conftool/dbconfig/20221123-013627-marostegui.json [01:36:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:59] (03CR) 10Ssingh: [C: 03+2] admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog) [01:43:04] (03CR) 10Ssingh: [V: 03+2 C: 03+2] admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog) [01:43:31] (03PS3) 10Ssingh: admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog) [01:43:52] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye [01:49:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321130)', diff saved to https://phabricator.wikimedia.org/P40707 and previous config saved to /var/cache/conftool/dbconfig/20221123-014912-marostegui.json [01:49:18] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [01:56:25] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:00:31] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P40708 and previous config saved to /var/cache/conftool/dbconfig/20221123-020418-marostegui.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:14:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage [02:15:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [02:15:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage [02:19:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P40709 and previous config saved to /var/cache/conftool/dbconfig/20221123-021925-marostegui.json [02:19:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [02:21:33] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul) [02:27:04] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10andrea.denisse) [02:27:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2041'] [02:27:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul) 05Open→03Resolved @jcrespo this is done [02:27:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [02:30:05] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:30:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [02:31:23] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:34:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321130)', diff saved to https://phabricator.wikimedia.org/P40710 and previous config saved to /var/cache/conftool/dbconfig/20221123-023431-marostegui.json [02:34:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance [02:34:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [02:34:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance [02:34:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321130)', diff saved to https://phabricator.wikimedia.org/P40711 and previous config saved to /var/cache/conftool/dbconfig/20221123-023453-marostegui.json [02:42:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2041.codfw.wmnet with OS bullseye [02:47:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321130)', diff saved to https://phabricator.wikimedia.org/P40712 and previous config saved to /var/cache/conftool/dbconfig/20221123-024751-marostegui.json [02:47:57] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [02:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:57:51] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P40713 and previous config saved to /var/cache/conftool/dbconfig/20221123-030257-marostegui.json [03:09:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:15:13] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:18:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P40714 and previous config saved to /var/cache/conftool/dbconfig/20221123-031804-marostegui.json [03:19:13] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:09] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:24:59] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:29:59] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321130)', diff saved to https://phabricator.wikimedia.org/P40715 and previous config saved to /var/cache/conftool/dbconfig/20221123-033310-marostegui.json [03:33:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance [03:33:17] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:33:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance [03:33:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321130)', diff saved to https://phabricator.wikimedia.org/P40716 and previous config saved to /var/cache/conftool/dbconfig/20221123-033332-marostegui.json [03:45:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321130)', diff saved to https://phabricator.wikimedia.org/P40717 and previous config saved to /var/cache/conftool/dbconfig/20221123-034554-marostegui.json [03:46:00] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:51:41] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [04:01:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P40718 and previous config saved to /var/cache/conftool/dbconfig/20221123-040100-marostegui.json [04:07:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:09:03] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:16:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P40719 and previous config saved to /var/cache/conftool/dbconfig/20221123-041607-marostegui.json [04:19:59] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:21:59] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [04:24:59] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:28:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:29:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:30:19] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321130)', diff saved to https://phabricator.wikimedia.org/P40720 and previous config saved to /var/cache/conftool/dbconfig/20221123-043114-marostegui.json [04:31:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [04:31:20] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [04:31:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [04:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40721 and previous config saved to /var/cache/conftool/dbconfig/20221123-043135-marostegui.json [04:39:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:44:31] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40722 and previous config saved to /var/cache/conftool/dbconfig/20221123-044523-marostegui.json [04:45:29] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [04:48:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:52:19] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [04:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:00:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P40723 and previous config saved to /var/cache/conftool/dbconfig/20221123-050029-marostegui.json [05:08:33] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:15:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P40724 and previous config saved to /var/cache/conftool/dbconfig/20221123-051536-marostegui.json [05:18:43] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:19:01] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:47] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:29:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40725 and previous config saved to /var/cache/conftool/dbconfig/20221123-053043-marostegui.json [05:30:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [05:30:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [05:30:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [05:31:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40726 and previous config saved to /var/cache/conftool/dbconfig/20221123-053104-marostegui.json [05:34:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:38:47] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:39:03] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40727 and previous config saved to /var/cache/conftool/dbconfig/20221123-054345-marostegui.json [05:43:52] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [05:44:13] RECOVERY - cassandra-a CQL 10.64.32.22:9042 on aqs1018 is OK: TCP OK - 0.000 second response time on 10.64.32.22 port 9042 https://phabricator.wikimedia.org/T93886 [05:44:57] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:53:19] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [05:57:19] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5309726728 and 57442 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:57:43] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 640697488 and 287 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:57:43] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11642464760 and 57468 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:58:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add conversion for ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/859567 (owner: 10Giuseppe Lavagetto) [05:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P40728 and previous config saved to /var/cache/conftool/dbconfig/20221123-055852-marostegui.json [05:59:21] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 540688 and 95 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:59:47] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 121 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:00:08] (03PS5) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [06:01:47] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 507952 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:02:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:02:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:02:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T321126)', diff saved to https://phabricator.wikimedia.org/P40729 and previous config saved to /var/cache/conftool/dbconfig/20221123-060228-marostegui.json [06:02:34] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:04:41] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:05:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321126)', diff saved to https://phabricator.wikimedia.org/P40730 and previous config saved to /var/cache/conftool/dbconfig/20221123-060500-marostegui.json [06:07:48] (03PS6) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [06:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P40731 and previous config saved to /var/cache/conftool/dbconfig/20221123-060956-root.json [06:10:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:10:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:11:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:12:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P40732 and previous config saved to /var/cache/conftool/dbconfig/20221123-061226-root.json [06:13:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) @wiki_willy @Jclark-ctr even if the task is stalled, just to make sure: these servers are still in rotation, Please do not decommission them until we've removed them. We need to resolve... [06:13:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P40733 and previous config saved to /var/cache/conftool/dbconfig/20221123-061358-marostegui.json [06:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:27:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P40734 and previous config saved to /var/cache/conftool/dbconfig/20221123-062731-root.json [06:29:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40735 and previous config saved to /var/cache/conftool/dbconfig/20221123-062905-marostegui.json [06:29:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance [06:29:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance [06:29:11] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:39:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:39:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:39:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T321130)', diff saved to https://phabricator.wikimedia.org/P40736 and previous config saved to /var/cache/conftool/dbconfig/20221123-063932-marostegui.json [06:39:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:42:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P40737 and previous config saved to /var/cache/conftool/dbconfig/20221123-064236-root.json [06:51:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321130)', diff saved to https://phabricator.wikimedia.org/P40738 and previous config saved to /var/cache/conftool/dbconfig/20221123-065153-marostegui.json [06:51:59] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:57:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P40739 and previous config saved to /var/cache/conftool/dbconfig/20221123-065741-root.json [06:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:04:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:05:41] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P40740 and previous config saved to /var/cache/conftool/dbconfig/20221123-070659-marostegui.json [07:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:12:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P40741 and previous config saved to /var/cache/conftool/dbconfig/20221123-071246-root.json [07:20:03] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 28307706224 and 1439 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:20:15] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45376667648 and 1451 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:20:45] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49965269256 and 1481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:20:45] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25276122560 and 1481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:20:45] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31310753096 and 1482 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:22:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P40742 and previous config saved to /var/cache/conftool/dbconfig/20221123-072208-marostegui.json [07:37:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321130)', diff saved to https://phabricator.wikimedia.org/P40743 and previous config saved to /var/cache/conftool/dbconfig/20221123-073714-marostegui.json [07:37:21] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:37:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2112.codfw.wmnet with reason: Maintenance [07:37:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2112.codfw.wmnet with reason: Maintenance [07:40:19] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:50] (03CR) 10ArielGlenn: "Still a pcc failure https://puppet-compiler.wmflabs.org/output/852260/38398/clouddumps1001.wikimedia.org/change.clouddumps1001.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [07:44:12] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) I have left mysql stopped so @Papaul can do the test whenever he wants. [07:48:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:48:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:57:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T0800) [08:00:05] kart_ and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] (03PS2) 10KartikMistry: Make Western Frisian Wikipedia Machine Translation stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415) [08:00:18] * kart_ is here [08:00:22] <_joe_> o/ [08:00:23] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1027.eqiad.wmnet with reason: Remove from cluster for eventual reimage [08:00:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1027.eqiad.wmnet with reason: Remove from cluster for eventual reimage [08:01:05] _joe_: go ahead with your patch, while I just rebased my config patch.. still on CI [08:01:33] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:42] (03PS1) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) [08:01:44] (03PS1) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) [08:01:46] (03PS1) 10Giuseppe Lavagetto: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) [08:01:48] (03PS1) 10Giuseppe Lavagetto: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) [08:01:59] <_joe_> kart_: tbh, I was waiting for a deployer to be around [08:02:19] <_joe_> I needed a +1 on the patch at least [08:02:21] (03CR) 10CI reject: [V: 04-1] site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [08:02:25] <_joe_> so I will just wait [08:02:29] OK. I'll go with my patch first and see if anyone around. [08:03:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415) (owner: 10KartikMistry) [08:03:58] (03CR) 10CI reject: [V: 04-1] site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto) [08:04:14] (03Merged) 10jenkins-bot: Make Western Frisian Wikipedia Machine Translation stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415) (owner: 10KartikMistry) [08:04:32] !log kartik@deploy1002 Started scap: Backport for [[gerrit:859161|Make Western Frisian Wikipedia Machine Translation stricter by 10% (T323415)]] [08:04:32] (03PS2) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) [08:04:34] (03PS2) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) [08:04:36] (03PS2) 10Giuseppe Lavagetto: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) [08:04:38] T323415: Make Western Frisian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T323415 [08:04:38] (03PS2) 10Giuseppe Lavagetto: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) [08:04:55] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:859161|Make Western Frisian Wikipedia Machine Translation stricter by 10% (T323415)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:05:15] (03CR) 10CI reject: [V: 04-1] site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [08:06:41] (03CR) 10CI reject: [V: 04-1] site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto) [08:07:21] <_joe_> urbanecm, Amir1: around? [08:09:55] (03PS3) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) [08:09:57] (03PS3) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) [08:09:59] (03PS3) 10Giuseppe Lavagetto: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) [08:10:01] (03PS3) 10Giuseppe Lavagetto: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) [08:12:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS bullseye [08:12:11] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bullseye [08:13:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:14:29] (03CR) 10JMeybohm: "Nice! Do we want this right now? In that case we will have to do a backport release of the 0.1.x version of this chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 (owner: 10Alexandros Kosiaris) [08:14:32] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:859161|Make Western Frisian Wikipedia Machine Translation stricter by 10% (T323415)]] (duration: 10m 00s) [08:14:38] T323415: Make Western Frisian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T323415 [08:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:16:36] (03CR) 10JMeybohm: [C: 03+2] pontoon: Add .crt filename suffix to PKI root CA [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm) [08:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:25:02] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1027.eqiad.wmnet with reason: host reimage [08:26:06] (03PS1) 10Marostegui: db1133: Move it to test-s4 section [puppet] - 10https://gerrit.wikimedia.org/r/859972 (https://phabricator.wikimedia.org/T322993) [08:27:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1027.eqiad.wmnet with reason: host reimage [08:28:33] 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) We'll attempt to build using RQ and the Django RQ module. RQ supports basic job queuing, as well as job dependencies and the ability to get job status. Other than supp... [08:30:29] 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) Basic proof-of-concept for queuing have been done of simple queues. Remaining is the job dependency and job status. These are supported by RQ directly, but it's unclea... [08:34:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:39:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:42:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1027.eqiad.wmnet with OS bullseye [08:42:11] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bullseye completed: - ganeti1027 (**PASS**) - Downtimed on... [08:53:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:57:03] (03PS5) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [08:59:07] (03PS2) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013) [09:04:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:00] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:06:01] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:06:16] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:06:32] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:06:40] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:34] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:44] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:58] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:12:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:14:43] !log restart kube-apiserver on ml-serve-ctrl1001 as attempt to mitigate weird LIST latencies [09:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:15:32] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:16:32] !log set thanos ring replicas to 3.10 T311690 [09:16:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:38] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [09:18:10] (03PS4) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) [09:19:29] !log restart kube-apiserver on ml-staging-ctrl2001 as attempt to mitigate weird LIST latencies [09:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:49] cc: klausman: --^ (both ml-serve-eqiad and staging :) [09:21:06] darn [09:21:08] (03PS3) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) [09:21:51] (03CR) 10Vgutierrez: [C: 03+1] Transferer: Enable PBKDF2 usage with 310000 iterations (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo) [09:23:13] (03CR) 10David Caro: "Just one comment, looks good" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [09:23:51] (03CR) 10Elukey: [C: 03+2] ml-services: Update docker images to use single model server [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) (owner: 10Ilias Sarantopoulos) [09:24:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:24:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST validatingwebhookconfigurations) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:25:10] (03PS1) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) [09:25:14] (03PS1) 10Muehlenhoff: Add udevd to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/859975 (https://phabricator.wikimedia.org/T135991) [09:28:35] (03CR) 10Elukey: [C: 03+2] team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:29:40] (03CR) 10Stevemunene: [C: 03+2] Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene) [09:32:48] (03CR) 10MVernon: swift: move ms-be2050 to new naming schema (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [09:33:27] !log stevemunene@deploy1002 Started deploy [analytics/turnilo/deploy@51da050]: (no justification provided) [09:33:43] !log stevemunene@deploy1002 Finished deploy [analytics/turnilo/deploy@51da050]: (no justification provided) (duration: 00m 15s) [09:35:05] (03CR) 10Kosta Harlan: [C: 04-2] GrowthExperiments: Allow accessing NewImpact module in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [09:36:52] (03PS1) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) [09:37:51] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32 and 7337 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:38:07] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) [09:38:41] (03CR) 10Jbond: [C: 04-1] swift: move ms-be2050 to new naming schema (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [09:41:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:42:19] (03PS1) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) [09:42:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:42:34] !log stevemunene@deploy1002 Started deploy [analytics/turnilo/deploy@51da050]: (no justification provided) [09:42:39] !log stevemunene@deploy1002 Finished deploy [analytics/turnilo/deploy@51da050]: (no justification provided) (duration: 00m 05s) [09:43:40] (03CR) 10Jbond: [C: 03+1] Add udevd to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/859975 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:45:22] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for virtlogd [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991) [09:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:49:10] (03PS2) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) [09:49:12] (03PS2) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) [09:49:14] (03PS2) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) [09:50:19] (03CR) 10Raymond Ndibe: "This change is ready for review." (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [09:51:33] (03PS3) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) [09:52:23] PROBLEM - swift eqiad object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe1001 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [09:55:18] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38400/console" [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [09:55:20] (03CR) 10MVernon: swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [09:57:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:59:10] (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [10:00:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet [10:01:30] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1047: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859982 (https://phabricator.wikimedia.org/T319184) [10:02:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:03:35] (03PS1) 10Vgutierrez: archiva: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) [10:05:56] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "final sync before merging 804575 - jbond@cumin2002" [10:06:15] (03PS1) 10Vgutierrez: gerrit: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859986 (https://phabricator.wikimedia.org/T238720) [10:07:17] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1998912 and 1650 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:08:15] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "final sync before merging 804575 - jbond@cumin2002" [10:08:25] (03PS6) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 [10:08:32] (03CR) 10Jbond: [C: 03+2] sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [10:09:00] (03CR) 10Jbond: [C: 03+2] P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [10:10:32] (03CR) 10Muehlenhoff: [C: 03+2] Add udevd to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/859975 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:10:35] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) [10:10:45] (03CR) 10Alexandros Kosiaris: felix: Instruct felix to set the src parameter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 (owner: 10Alexandros Kosiaris) [10:10:53] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:11:13] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [10:11:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet [10:11:59] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:04] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38401/console" [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [10:13:05] (03Merged) 10jenkins-bot: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [10:13:16] (03CR) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [10:13:23] (03CR) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [10:14:29] (03PS4) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) [10:14:53] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [10:14:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [10:15:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:15:47] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:11] (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond) [10:16:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:17] (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [10:16:23] (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [10:16:29] (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [10:16:40] httpbb_hourly_appserver.service < handled [10:18:05] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 393280 and 2299 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:18:09] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:18:21] moritzm: keyholder seems to be unhappy in cumin1001 [10:18:56] oh I just saw -sre, sorry [10:20:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [10:20:47] (03CR) 10Jcrespo: "Deploying as is- this doesn't fix all issues, but it is not a bad thing to merge." [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [10:20:51] (03CR) 10Jcrespo: [C: 03+2] Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [10:20:58] (03CR) 10Jcrespo: [C: 03+2] Transferer: Enable PBKDF2 usage with 310000 iterations [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo) [10:21:05] (03CR) 10Jcrespo: [C: 03+2] Update changelog for release 1.1 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859446 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo) [10:21:38] (03CR) 10Jcrespo: ""man transferpy" FYI" [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455 (owner: 10Jcrespo) [10:21:43] (03CR) 10Jcrespo: [C: 03+2] Add man page for transfer.py executable [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455 (owner: 10Jcrespo) [10:23:09] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) [10:23:17] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) a:05Volans→03None [10:27:28] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [10:31:13] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 40 and 3085 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:31:43] (03CR) 10Btullis: [C: 03+1] "I'm a bit late to the party, but thanks elukey. Looks good." [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:33:49] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1015832 and 3241 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:37:23] (03CR) 10David Caro: [C: 03+1] openstack: cloudvirt: unset_maintenance: clarify SAL message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859451 (owner: 10Arturo Borrero Gonzalez) [10:37:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:37:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:38:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40745 and previous config saved to /var/cache/conftool/dbconfig/20221123-103805-marostegui.json [10:38:06] (03CR) 10David Caro: [C: 03+2] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [10:38:11] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:38:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudvirt: unset_maintenance: clarify SAL message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859451 (owner: 10Arturo Borrero Gonzalez) [10:39:56] (03CR) 10Marostegui: [C: 03+2] db1133: Move it to test-s4 section [puppet] - 10https://gerrit.wikimedia.org/r/859972 (https://phabricator.wikimedia.org/T322993) (owner: 10Marostegui) [10:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40746 and previous config saved to /var/cache/conftool/dbconfig/20221123-104023-marostegui.json [10:42:30] (03PS1) 10Kosta Harlan: [WIP] GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [10:42:55] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:24] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:46:41] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:47:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:48:17] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:48:17] (03PS1) 10Jbond: swift: base the object number on the scsi path [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677) [10:49:05] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:50:48] (03PS2) 10Jbond: swift: base the object number on the scsi path [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677) [10:51:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38403/console" [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:28] this is due to a deployment --^ [10:52:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] swift: base the object number on the scsi path [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:54:05] (03PS2) 10Kosta Harlan: [WIP] GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [10:54:07] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove unused GEHomepageNewAccountVariants config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 [10:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40747 and previous config saved to /var/cache/conftool/dbconfig/20221123-105529-marostegui.json [10:56:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:02] (03CR) 10Giuseppe Lavagetto: "To check the reassigning logic, please see" [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [10:57:59] (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [10:58:06] (03PS2) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) [10:59:32] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, can we deploy the releases right away or do we need to wait on the hosts actually being in production?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [11:01:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) p:05Triage→03Medium [11:01:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:20] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1047: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859982 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:05:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:06:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1047: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859982 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:06:52] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [11:07:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1047.eqiad.wmnet with O... [11:09:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: Add mw-web, mw-api-ext [dns] - 10https://gerrit.wikimedia.org/r/859571 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:09:55] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:10:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40748 and previous config saved to /var/cache/conftool/dbconfig/20221123-111036-marostegui.json [11:10:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:13] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:48] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Test - volans@cumin1001" [11:13:05] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Vgutierrez) @legoktm it looks like the easiest approach would be adding lists1001 as a backend server on ATS and set the caching policy to `pass`. Under this scenario, lists.wikimed... [11:13:23] (03PS3) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) [11:13:41] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10akosiaris) Hey everyone, I think this discussion would benefit greatly from a higher bandwidth venue than phabricator. It's quite clear there are pain points regarding t... [11:14:13] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Test - volans@cumin1001" [11:14:46] (03CR) 10Clément Goubert: [C: 03+2] wmnet: Add mw-web, mw-api-ext [dns] - 10https://gerrit.wikimedia.org/r/859571 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:15:03] !log Adding mw-web and mw-api-ext to wmnet dns [11:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:22] (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:16:40] Hold up, that change isn't actually good, fixing. [11:16:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is mediawiki, so we need a bit more refinement around discovery records." [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:16:53] (03PS1) 10Cathal Mooney: Move if statement around 'ospf' section in asw template [homer/public] - 10https://gerrit.wikimedia.org/r/860003 (https://phabricator.wikimedia.org/T312635) [11:17:00] (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:18:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::catalog: Add mw-web and mw-api-ext (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:18:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::catalog: Add mw-web and mw-api-ext (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:19:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:19:17] (03PS2) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) [11:19:19] (03PS1) 10Clément Goubert: wmnet: Fix mw-web, mw-api-ext codfw [dns] - 10https://gerrit.wikimedia.org/r/860004 (https://phabricator.wikimedia.org/T323621) [11:19:25] (03CR) 10Cathal Mooney: [C: 03+2] Move if statement around 'ospf' section in asw template [homer/public] - 10https://gerrit.wikimedia.org/r/860003 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:19:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:20:13] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage [11:20:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: Fix mw-web, mw-api-ext codfw [dns] - 10https://gerrit.wikimedia.org/r/860004 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:20:54] (03CR) 10Jbond: [C: 03+2] swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [11:21:00] (03Merged) 10jenkins-bot: Move if statement around 'ospf' section in asw template [homer/public] - 10https://gerrit.wikimedia.org/r/860003 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:21:48] (03CR) 10Clément Goubert: [C: 03+2] wmnet: Fix mw-web, mw-api-ext codfw [dns] - 10https://gerrit.wikimedia.org/r/860004 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:22:50] !log authdns-update for mw-web and mw-api-ext [11:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage [11:23:38] (03CR) 10David Caro: [C: 03+1] "LGTM, looking a bit into it, will it use this feature? (from https://libvirt.org/manpages/virtlogd.html)" [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:24:38] !log changing port-speed configuration syntax on asw1-b12-drmrs [11:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40750 and previous config saved to /var/cache/conftool/dbconfig/20221123-112542-marostegui.json [11:25:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:25:48] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:25:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:26:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40751 and previous config saved to /var/cache/conftool/dbconfig/20221123-112604-marostegui.json [11:26:56] (03CR) 10Jelto: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [11:28:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40752 and previous config saved to /var/cache/conftool/dbconfig/20221123-112821-marostegui.json [11:28:41] (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:30:13] (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: Remove unused GEHomepageNewAccountVariants config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 (owner: 10Kosta Harlan) [11:30:17] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 14379970856 and 1048 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:33:51] (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:34:38] (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:36:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [11:36:19] PROBLEM - Check systemd state on cp5020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:19] (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:37:37] (03PS1) 10Jbond: puppetdb: add cpu_flags back to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/860006 [11:38:56] (03CR) 10Jbond: [C: 03+2] puppetdb: add cpu_flags back to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/860006 (owner: 10Jbond) [11:39:17] (03PS5) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) [11:39:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [11:40:12] (03CR) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:42:22] jouncebot: nowandnext [11:42:22] No deployments scheduled for the next 2 hour(s) and 17 minute(s) [11:42:23] In 2 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T1400) [11:42:36] jbond: ^^ are you playing with cp5020? [11:42:37] (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:42:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet [11:42:44] I’ll deploy a security patch if nobody objects [11:43:24] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38404/console" [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:43:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40753 and previous config saved to /var/cache/conftool/dbconfig/20221123-114327-marostegui.json [11:44:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:44:09] https://www.irccloud.com/pastebin/vT1kDXHp/ [11:44:17] (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:45:10] (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:45:12] jbond: ^^ random error? [11:45:17] (03PS3) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) [11:46:04] (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [11:46:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet [11:46:45] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0236 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:49:17] this was me it shuld clear soon [11:50:37] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:50:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Active/passive records have a different type." [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:51:17] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36124399688 and 1634 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:52:19] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0004914 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:52:42] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1047.eqiad.wmnet with OS bullseye [11:52:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu... [11:53:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu... [11:55:10] !log updating mw canaries to 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 T323358 [11:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] (03PS4) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) [11:56:55] (03PS10) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [11:57:21] (03CR) 10CI reject: [V: 04-1] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:57:23] (03CR) 10Clément Goubert: mw-web, mw-api-ext: add discovery records (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [11:57:27] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [11:57:54] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [11:58:09] I’m deploying my patch now ftr [11:58:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40754 and previous config saved to /var/cache/conftool/dbconfig/20221123-115834-marostegui.json [12:01:20] (03CR) 10David Caro: [C: 03+2] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [12:01:43] !log lucaswerkmeister-wmde: Deployed security patch for T323592 [12:02:18] (03PS2) 10Giuseppe Lavagetto: Add conversion for ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/859567 [12:03:32] * Lucas_WMDE done [12:04:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1046: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860010 (https://phabricator.wikimedia.org/T319184) [12:04:34] (03PS3) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) [12:04:36] (03PS3) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) [12:04:38] (03PS3) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) [12:04:42] (03Merged) 10jenkins-bot: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [12:05:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:05:50] (03CR) 10Clément Goubert: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [12:06:12] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38405/console" [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [12:06:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:06:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:07:17] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38406/console" [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [12:07:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:09:36] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13384325296 and 1123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:09:37] (03PS16) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [12:10:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [12:10:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:11:00] (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [12:13:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40755 and previous config saved to /var/cache/conftool/dbconfig/20221123-121340-marostegui.json [12:13:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:13:47] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:13:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T321126)', diff saved to https://phabricator.wikimedia.org/P40756 and previous config saved to /var/cache/conftool/dbconfig/20221123-121402-marostegui.json [12:16:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321126)', diff saved to https://phabricator.wikimedia.org/P40758 and previous config saved to /var/cache/conftool/dbconfig/20221123-121618-marostegui.json [12:17:45] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/860012 (https://phabricator.wikimedia.org/T135991) [12:18:14] (03CR) 10Hnowlan: [C: 03+1] "LGTM. Context from bpirkle:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto) [12:18:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:18:30] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1046: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860010 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:18:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1046: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860010 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:19:08] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mnz-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:27] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bullseye [12:19:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1046.eqiad.wmnet with O... [12:20:50] RECOVERY - Check systemd state on cp5020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:58] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38631679832 and 2361 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:16] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2904 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:24:04] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:10] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:25:14] (03CR) 10Jbond: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:26:06] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858659 (https://phabricator.wikimedia.org/T306200) (owner: 10Andrew Bogott) [12:26:39] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [12:28:00] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 310 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:13] (03PS4) 10David Caro: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [12:30:20] (03Abandoned) 10Jbond: puppet_compiler: drop yaml dir from export facts tar ball [puppet] - 10https://gerrit.wikimedia.org/r/745990 (owner: 10Jbond) [12:31:20] (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [12:31:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40759 and previous config saved to /var/cache/conftool/dbconfig/20221123-123125-marostegui.json [12:31:34] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 525 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:15] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621) [12:32:20] T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 [12:32:37] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [12:32:42] !log restarting pybal on lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet for mw-web and mw-api-ext behind LVS T323621 [12:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621) [12:34:20] (03CR) 10Filippo Giunchedi: Add new graphite hosts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [12:36:08] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [12:36:45] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [12:39:43] (03PS1) 10Clément Goubert: Fix mw-api-ext eqiad service ip [puppet] - 10https://gerrit.wikimedia.org/r/860015 [12:40:55] (03PS2) 10Clément Goubert: service::catalog: Fix mw-api-ext eqiad service ip [puppet] - 10https://gerrit.wikimedia.org/r/860015 [12:43:13] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet [12:44:26] (03CR) 10Clément Goubert: [C: 03+2] service::catalog: Fix mw-api-ext eqiad service ip [puppet] - 10https://gerrit.wikimedia.org/r/860015 (owner: 10Clément Goubert) [12:45:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:46:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40760 and previous config saved to /var/cache/conftool/dbconfig/20221123-124631-marostegui.json [12:48:09] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [12:49:08] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621) [12:49:13] T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 [12:50:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:16] (03CR) 10Muehlenhoff: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:50:34] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [12:52:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621) [12:54:42] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:58] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs (T323621) [12:56:04] T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 [12:56:24] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:58:15] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [12:58:34] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs (T323621) [12:59:21] (03PS1) 10Muehlenhoff: Add logind to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/860018 (https://phabricator.wikimedia.org/T135991) [12:59:33] (03PS1) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 [13:01:34] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [13:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321126)', diff saved to https://phabricator.wikimedia.org/P40761 and previous config saved to /var/cache/conftool/dbconfig/20221123-130138-marostegui.json [13:01:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [13:01:45] (03CR) 10CI reject: [V: 04-1] spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [13:01:45] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:01:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [13:02:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T321126)', diff saved to https://phabricator.wikimedia.org/P40762 and previous config saved to /var/cache/conftool/dbconfig/20221123-130159-marostegui.json [13:02:23] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1046.eqiad.wmnet with OS bullseye [13:02:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bu... [13:02:41] (03PS4) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) [13:02:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:07:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:52] (03PS1) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [13:17:34] (03PS2) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 [13:18:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860018 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:18:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38407/console" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [13:18:57] (03PS1) 10Jaime Nuche: scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 [13:19:02] (03PS2) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [13:20:34] (03PS3) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [13:23:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [13:24:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:25:16] !log installing apache security updates on mw canaries [13:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [13:27:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:56] (03CR) 10Slyngshede: "Would like feedback on general direction or obvious oversights." [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [13:31:28] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [13:32:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:00] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1045: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860047 (https://phabricator.wikimedia.org/T319184) [13:39:16] !log updating mw canaries to 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 T323358 [13:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:35] (03PS4) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [13:52:38] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/860047 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:52:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1045: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860047 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:53:41] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS bullseye [13:53:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1045.eqiad.wmnet with O... [13:54:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [13:56:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [13:57:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C [13:58:27] (03PS5) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:11] o/ [14:00:18] yup, nothing in the calendar [14:01:28] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:02:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321126)', diff saved to https://phabricator.wikimedia.org/P40763 and previous config saved to /var/cache/conftool/dbconfig/20221123-140215-marostegui.json [14:02:22] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:02:37] (03PS6) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [14:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:05:46] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:06:32] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:48] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage [14:06:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [14:07:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [14:07:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [14:07:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40764 and previous config saved to /var/cache/conftool/dbconfig/20221123-140712-ladsgroup.json [14:07:22] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:07:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [14:07:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T323214)', diff saved to https://phabricator.wikimedia.org/P40765 and previous config saved to /var/cache/conftool/dbconfig/20221123-140732-ladsgroup.json [14:07:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C [14:08:22] (03CR) 10Clément Goubert: [C: 03+2] service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [14:08:47] (03CR) 10Volans: "The only concern I have with this approach is that the other SREs seeing the systemd alert would not know what to do and why it's alerting" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [14:10:12] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage [14:12:11] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:20] (03PS7) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [14:14:01] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro [14:14:21] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro [14:14:52] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mw-web [14:15:12] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mw-api-ext [14:15:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:15:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P40766 and previous config saved to /var/cache/conftool/dbconfig/20221123-141543-ladsgroup.json [14:15:45] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [14:15:47] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:15:48] (03CR) 10CI reject: [V: 04-1] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [14:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40767 and previous config saved to /var/cache/conftool/dbconfig/20221123-141722-marostegui.json [14:18:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 (owner: 10Jaime Nuche) [14:19:22] (03PS4) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) [14:19:59] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: httpbb random read timeout on cumin2002 - https://phabricator.wikimedia.org/T323707 (10Volans) p:05Triage→03Medium [14:20:42] (03CR) 10Elukey: [C: 03+1] "LGTM! Added also Ben and Steve so we can get the green light from the DE folks as well." [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [14:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:22:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P40768 and previous config saved to /var/cache/conftool/dbconfig/20221123-142159-ladsgroup.json [14:22:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto) [14:23:58] (03CR) 10Muehlenhoff: [C: 03+2] Add logind to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/860018 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:24:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:18] (03PS5) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) [14:24:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:25:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:26:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:26:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:26:45] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38409/console" [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [14:26:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:27:11] (03Merged) 10jenkins-bot: remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto) [14:28:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I hate this script so, so much 😄" [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [14:28:35] (03CR) 10Clément Goubert: [C: 03+1] scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 (owner: 10Jaime Nuche) [14:29:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:02] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:31:08] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [14:32:21] (03CR) 10Btullis: [C: 03+1] "Thanks. Looks good to me. Will be on the lookout for any changes to behaviour, but don't anticipate anything." [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [14:32:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40769 and previous config saved to /var/cache/conftool/dbconfig/20221123-143228-marostegui.json [14:32:49] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:33:30] (03CR) 10Btullis: [C: 03+1] "LGTM,thanks." [puppet] - 10https://gerrit.wikimedia.org/r/860012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:36:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS bullseye [14:36:31] (03CR) 10Marostegui: [C: 03+1] Add Cumin alias for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff) [14:36:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1045.eqiad.wmnet with OS bu... [14:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40770 and previous config saved to /var/cache/conftool/dbconfig/20221123-143706-ladsgroup.json [14:40:28] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [14:41:32] !log rebalance Ganeti group B/eqiad T311687 [14:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:38] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [14:43:42] RECOVERY - Ganeti memory on ganeti1015 is OK: OK Memory 83% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:44:43] (03CR) 10Vgutierrez: [C: 03+2] archiva: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [14:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321126)', diff saved to https://phabricator.wikimedia.org/P40771 and previous config saved to /var/cache/conftool/dbconfig/20221123-144735-marostegui.json [14:47:42] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:49:41] (03PS1) 10Ssingh: hiera: lvs4007: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/860057 [14:49:58] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [14:52:11] ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Added more downtime, but replacement batteries are on their way https://wikitech.wikimedia.org/wiki/MegaCli%23 [14:52:11] ng [14:52:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40772 and previous config saved to /var/cache/conftool/dbconfig/20221123-145212-ladsgroup.json [14:54:09] (03PS8) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [14:54:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:54:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:54:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:54:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T321126)', diff saved to https://phabricator.wikimedia.org/P40773 and previous config saved to /var/cache/conftool/dbconfig/20221123-145446-marostegui.json [14:54:52] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:56:18] RECOVERY - graphite.wikimedia.org requires authentication on graphite2004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 548 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:57:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321126)', diff saved to https://phabricator.wikimedia.org/P40774 and previous config saved to /var/cache/conftool/dbconfig/20221123-145701-marostegui.json [14:59:50] (03PS5) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) [15:01:27] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [15:01:37] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05In progress→03Resolved [15:01:45] (03PS9) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [15:03:45] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [15:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132 Maint', diff saved to https://phabricator.wikimedia.org/P40775 and previous config saved to /var/cache/conftool/dbconfig/20221123-150621-ladsgroup.json [15:06:58] (03CR) 10Clément Goubert: [C: 03+2] Add new graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:07:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P40776 and previous config saved to /var/cache/conftool/dbconfig/20221123-150719-ladsgroup.json [15:08:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:08:57] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:09:08] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:09:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:10:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [15:10:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [15:10:45] !log deploying change 859575 on mw-* wikikube deployments [15:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:20] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:56] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:12:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40777 and previous config saved to /var/cache/conftool/dbconfig/20221123-151207-marostegui.json [15:13:20] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:15:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P40778 and previous config saved to /var/cache/conftool/dbconfig/20221123-151507-ladsgroup.json [15:15:18] !log updating snapshot* hosts to PHP 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 T323358 [15:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:17:51] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [15:20:10] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:20:53] godog: I deployed your graphite change to wikikube [15:21:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto) [15:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:22:26] <_joe_> uh sigh [15:25:20] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/860012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:26:09] (03Merged) 10jenkins-bot: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto) [15:26:32] jouncebot: now [15:26:32] No deployments scheduled for the next 3 hour(s) and 33 minute(s) [15:26:35] Ace [15:26:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40779 and previous config saved to /var/cache/conftool/dbconfig/20221123-152714-marostegui.json [15:28:48] !log jforrester@deploy1002 Started deploy [integration/docroot@52e4a00]: Deploying 52e4a00 for T311097 pointing Codex docs to latest [15:28:54] T311097: docs: Consider making the latest release branch the default for the live docs site - https://phabricator.wikimedia.org/T311097 [15:29:03] !log jforrester@deploy1002 Finished deploy [integration/docroot@52e4a00]: Deploying 52e4a00 for T311097 pointing Codex docs to latest (duration: 00m 14s) [15:29:53] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [15:30:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P40780 and previous config saved to /var/cache/conftool/dbconfig/20221123-153012-ladsgroup.json [15:30:16] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [15:31:05] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [15:35:55] (03PS1) 10Ssingh: lvs4010: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/860067 (https://phabricator.wikimedia.org/T317247) [15:35:56] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [15:36:18] claime: amazing! thank you so much <3 [15:36:33] godog: np <3 [15:37:27] (03CR) 10Hnowlan: [C: 03+2] api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [15:38:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323214)', diff saved to https://phabricator.wikimedia.org/P40782 and previous config saved to /var/cache/conftool/dbconfig/20221123-153824-ladsgroup.json [15:38:31] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:40:59] (03PS1) 10Arturo Borrero Gonzalez: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 [15:41:45] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [15:41:46] !log btullis@cumin2002 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [15:42:17] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:42:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321126)', diff saved to https://phabricator.wikimedia.org/P40783 and previous config saved to /var/cache/conftool/dbconfig/20221123-154220-marostegui.json [15:42:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:42:27] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:42:30] (03Merged) 10jenkins-bot: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [15:42:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:42:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40784 and previous config saved to /var/cache/conftool/dbconfig/20221123-154242-marostegui.json [15:42:44] (03PS2) 10Arturo Borrero Gonzalez: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 [15:44:33] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating for lvs4009 and lvs4010 - sukhe@cumin2002" [15:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40785 and previous config saved to /var/cache/conftool/dbconfig/20221123-154459-marostegui.json [15:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P40786 and previous config saved to /var/cache/conftool/dbconfig/20221123-154517-ladsgroup.json [15:45:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating for lvs4009 and lvs4010 - sukhe@cumin2002" [15:45:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [15:48:20] (03PS28) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [15:48:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40787 and previous config saved to /var/cache/conftool/dbconfig/20221123-154831-ladsgroup.json [15:48:38] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:49:28] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:50:14] (03PS2) 10Filippo Giunchedi: hiera: replace graphite2003 with 2004 for graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) [15:51:51] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [15:52:00] (03CR) 10Muehlenhoff: install_server: Add dynamic raid configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [15:52:08] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38410/console" [puppet] - 10https://gerrit.wikimedia.org/r/860057 (owner: 10Ssingh) [15:52:11] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MoritzMuehlenhoff) >>! In T308677#8346238, @jbond wrote: > The underlining is... [15:52:15] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [15:52:20] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [15:52:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [15:53:10] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [15:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P40788 and previous config saved to /var/cache/conftool/dbconfig/20221123-155330-ladsgroup.json [15:55:16] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38412/console" [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:56:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] hiera: lvs4007: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/860057 (owner: 10Ssingh) [15:57:42] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hiera: replace graphite2003 with 2004 for graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:58:05] (03PS2) 10Jforrester: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 [15:58:07] (03CR) 10Jforrester: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 (owner: 10Jforrester) [15:58:25] (03CR) 10Vgutierrez: [C: 03+1] lvs4010: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/860067 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:59:04] (03PS3) 10Raymond Ndibe: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 (owner: 10Arturo Borrero Gonzalez) [15:59:40] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:00:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40789 and previous config saved to /var/cache/conftool/dbconfig/20221123-160005-marostegui.json [16:00:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P40790 and previous config saved to /var/cache/conftool/dbconfig/20221123-160022-ladsgroup.json [16:01:57] (03CR) 10Raymond Ndibe: [C: 03+2] tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 (owner: 10Arturo Borrero Gonzalez) [16:02:42] (03Merged) 10jenkins-bot: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 (owner: 10Arturo Borrero Gonzalez) [16:03:16] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [16:03:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) 05Resolved→03Open @BCornwall reopening I can not find myself on [[ https://ldap.toolforge.org/ | this list ]] and and not able to log into... [16:03:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P40791 and previous config saved to /var/cache/conftool/dbconfig/20221123-160338-ladsgroup.json [16:06:51] (03CR) 10Jelto: "This change is ready for review." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:07:01] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [16:08:07] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [16:08:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P40792 and previous config saved to /var/cache/conftool/dbconfig/20221123-160837-ladsgroup.json [16:09:00] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:09:19] (03PS29) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [16:09:34] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:10:34] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:11:02] (03PS1) 10Filippo Giunchedi: Remove graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718) [16:12:26] (03CR) 10Clément Goubert: [C: 03+2] scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 (owner: 10Jaime Nuche) [16:13:07] (03PS1) 10Hnowlan: thumbor: lower memory limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) [16:13:19] (03PS1) 10Filippo Giunchedi: wmnet: replace graphite2003 with graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/860073 (https://phabricator.wikimedia.org/T315524) [16:13:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:15:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40793 and previous config saved to /var/cache/conftool/dbconfig/20221123-161512-marostegui.json [16:16:16] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [16:16:22] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:51] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [16:17:20] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs[2001-2004].codfw.wmnet,aqs[1010-1015].eqiad.wmnet: T314309 restarting to pick up new JRE - eevans@cumin1001 [16:17:53] (03PS3) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 [16:17:55] (03PS1) 10Jbond: systemd::timer::job: update documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/860074 [16:17:57] (03PS1) 10Jbond: systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 [16:18:17] (03CR) 10Muehlenhoff: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:18:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P40794 and previous config saved to /var/cache/conftool/dbconfig/20221123-161844-ladsgroup.json [16:20:02] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [16:21:55] (03CR) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [16:22:09] (03CR) 10Filippo Giunchedi: [C: 03+2] "comment-only, self-merging" [dns] - 10https://gerrit.wikimedia.org/r/860073 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [16:23:31] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [16:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323214)', diff saved to https://phabricator.wikimedia.org/P40795 and previous config saved to /var/cache/conftool/dbconfig/20221123-162345-ladsgroup.json [16:23:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:23:51] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:24:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:24:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T323214)', diff saved to https://phabricator.wikimedia.org/P40796 and previous config saved to /var/cache/conftool/dbconfig/20221123-162407-ladsgroup.json [16:29:53] (03CR) 10Filippo Giunchedi: "LGTM, though I think we should change the default value to the link you posted:" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [16:30:15] (03CR) 10Filippo Giunchedi: [C: 03+1] systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [16:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40797 and previous config saved to /var/cache/conftool/dbconfig/20221123-163018-marostegui.json [16:30:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:30:22] (03CR) 10Filippo Giunchedi: systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [16:30:25] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:30:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:30:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:30:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:31:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:31:13] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:31:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T321126)', diff saved to https://phabricator.wikimedia.org/P40798 and previous config saved to /var/cache/conftool/dbconfig/20221123-163115-marostegui.json [16:33:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321126)', diff saved to https://phabricator.wikimedia.org/P40799 and previous config saved to /var/cache/conftool/dbconfig/20221123-163330-marostegui.json [16:33:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40800 and previous config saved to /var/cache/conftool/dbconfig/20221123-163351-ladsgroup.json [16:33:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:33:57] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:34:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:34:11] !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint1002'] [16:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40801 and previous config saved to /var/cache/conftool/dbconfig/20221123-163412-ladsgroup.json [16:35:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) [16:35:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:37:17] (03PS6) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [16:40:22] (03CR) 10Hnowlan: [C: 03+2] thumbor: lower memory limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:40:26] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [16:40:46] (03PS2) 10Jbond: systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 [16:40:48] (03PS1) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) [16:40:50] (03PS1) 10Volans: WIP (to be modified) [cookbooks] - 10https://gerrit.wikimedia.org/r/860081 [16:42:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38413/console" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [16:42:31] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [16:43:44] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [16:45:17] (03Merged) 10jenkins-bot: thumbor: lower memory limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:45:49] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [16:46:46] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [16:48:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10Jcross) Approved [16:48:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Jcross) Approved [16:48:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40802 and previous config saved to /var/cache/conftool/dbconfig/20221123-164837-marostegui.json [16:49:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond) [16:49:58] (03PS2) 10Giuseppe Lavagetto: proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 [16:51:01] (03CR) 10Filippo Giunchedi: [C: 03+1] spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [16:51:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:52:22] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED [16:53:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [16:55:50] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:56:14] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['contint1002'] [16:56:52] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED [16:57:24] (03PS10) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [16:57:52] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED [16:58:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) [17:02:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10nskaggs) It's exciting to see so many successful transitions to single NIC here already! Great work! However, I also want to ask tha... [17:03:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40803 and previous config saved to /var/cache/conftool/dbconfig/20221123-170343-marostegui.json [17:09:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 (owner: 10Giuseppe Lavagetto) [17:12:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:13:28] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:14:37] (03Merged) 10jenkins-bot: proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 (owner: 10Giuseppe Lavagetto) [17:16:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for arclamp1001 - pt1979@cumin2002" [17:18:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for arclamp1001 - pt1979@cumin2002" [17:18:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:18:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321126)', diff saved to https://phabricator.wikimedia.org/P40804 and previous config saved to /var/cache/conftool/dbconfig/20221123-171850-marostegui.json [17:18:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:18:56] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:19:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321126)', diff saved to https://phabricator.wikimedia.org/P40805 and previous config saved to /var/cache/conftool/dbconfig/20221123-171911-marostegui.json [17:21:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321126)', diff saved to https://phabricator.wikimedia.org/P40806 and previous config saved to /var/cache/conftool/dbconfig/20221123-172128-marostegui.json [17:21:39] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [17:22:09] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [17:24:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:27:11] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED [17:32:12] (03CR) 10Hashar: "Looking on gerrit1001.wikimedia.org in /var/log/apache2/gerrit.wikimedia.org.http.access.log there are only a few requests:" [puppet] - 10https://gerrit.wikimedia.org/r/859986 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [17:33:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs[2001-2004].codfw.wmnet,aqs[1010-1015].eqiad.wmnet: T314309 restarting to pick up new JRE - eevans@cumin1001 [17:34:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:36:11] (03CR) 10Dzahn: "ACK, I'm starting to feel this one causes more trouble than it fixes.hmmm" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [17:36:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40807 and previous config saved to /var/cache/conftool/dbconfig/20221123-173635-marostegui.json [17:36:55] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED [17:37:20] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED [17:39:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:39:57] !log initiating Cassandra bootstrap, aqs1018-a -- T307802 [17:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:03] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [17:41:46] RECOVERY - cassandra-b SSL 10.64.32.31:7001 on aqs1018 is OK: SSL OK - Certificate aqs1018-b valid until 2024-11-08 15:06:27 +0000 (expires in 715 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:42:20] RECOVERY - cassandra-b service on aqs1018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:42:33] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED [17:42:41] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020 [17:42:47] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [17:44:06] !log [Elastic] T319020 Kicked off rolling restart of cloudelastic to apply new heap size 8->10G; see `ryankemper@cumin1001` tmux session `cloudelastic_restarts` [17:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40809 and previous config saved to /var/cache/conftool/dbconfig/20221123-175141-marostegui.json [17:56:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323214)', diff saved to https://phabricator.wikimedia.org/P40810 and previous config saved to /var/cache/conftool/dbconfig/20221123-175625-ladsgroup.json [17:56:32] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:56:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:57:04] (03PS15) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [17:58:08] (03PS1) 10Reedy: Partial Revert "Remove pre PHP 7.4 serialize()/unserialize()" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/860030 (https://phabricator.wikimedia.org/T323236) [17:58:13] (03PS1) 10Hashar: eslint: switch to es2018 [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086 [17:59:54] (03CR) 10Hashar: "Gerrit has a few JavaScript plugins in ./plugins which are passed through eslint. I found out I could use the little bit modern es2018 whe" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086 (owner: 10Hashar) [18:00:14] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10MNadrofsky) @BTullis I approve this for @gmodena . With Will currently away, I'm acting manager for Gabriele. Let me know if you need anything else! [18:00:44] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED [18:01:25] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1056.mgmt.eqiad.wmnet with reboot policy FORCED [18:01:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:01:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [18:02:59] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [18:03:01] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [18:03:11] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [18:04:05] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [18:06:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321126)', diff saved to https://phabricator.wikimedia.org/P40812 and previous config saved to /var/cache/conftool/dbconfig/20221123-180648-marostegui.json [18:06:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance [18:06:55] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:07:01] (03PS2) 10Hashar: eslint: switch to es2018 [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086 [18:07:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance [18:07:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T321126)', diff saved to https://phabricator.wikimedia.org/P40813 and previous config saved to /var/cache/conftool/dbconfig/20221123-180709-marostegui.json [18:07:59] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [18:08:32] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [18:09:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321126)', diff saved to https://phabricator.wikimedia.org/P40814 and previous config saved to /var/cache/conftool/dbconfig/20221123-180924-marostegui.json [18:10:43] (03CR) 10Hashar: eslint: switch to es2018 (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086 (owner: 10Hashar) [18:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P40815 and previous config saved to /var/cache/conftool/dbconfig/20221123-181132-ladsgroup.json [18:12:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020 [18:12:22] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [18:16:37] 10SRE, 10ops-eqiad, 10Traffic: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10Jclark-ctr) 05Open→03Resolved Replaced cable Error has cleared [18:17:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:18:02] (03PS7) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [18:18:36] RECOVERY - Host lvs1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [18:22:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40816 and previous config saved to /var/cache/conftool/dbconfig/20221123-182220-ladsgroup.json [18:22:27] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:22:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:24:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40817 and previous config saved to /var/cache/conftool/dbconfig/20221123-182431-marostegui.json [18:26:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P40818 and previous config saved to /var/cache/conftool/dbconfig/20221123-182638-ladsgroup.json [18:30:36] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [18:36:24] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1056.mgmt.eqiad.wmnet with reboot policy FORCED [18:37:05] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1057.mgmt.eqiad.wmnet with reboot policy FORCED [18:37:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P40819 and previous config saved to /var/cache/conftool/dbconfig/20221123-183726-ladsgroup.json [18:38:26] (03CR) 10Jbond: install_server: Add dynamic raid configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [18:38:30] (03CR) 10Jbond: [C: 03+2] install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [18:38:51] (03PS5) 10Vlad.shapik: WP:Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) [18:39:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:39:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40820 and previous config saved to /var/cache/conftool/dbconfig/20221123-183937-marostegui.json [18:39:40] (03CR) 10Ssingh: [C: 03+2] hiera: lvs4007: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/860057 (owner: 10Ssingh) [18:41:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host arclamp1001.mgmt.eqiad.wmnet with reboot policy FORCED [18:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323214)', diff saved to https://phabricator.wikimedia.org/P40821 and previous config saved to /var/cache/conftool/dbconfig/20221123-184145-ladsgroup.json [18:41:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:41:51] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:42:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:42:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40822 and previous config saved to /var/cache/conftool/dbconfig/20221123-184207-ladsgroup.json [18:42:23] !log restart pybal on lvs4007.ulsfo.wmnet [18:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:46] (03PS1) 10Ssingh: sites.yaml: add lvs4010 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/860089 (https://phabricator.wikimedia.org/T317247) [18:44:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:44:19] (03CR) 10Ssingh: [C: 03+2] lvs4010: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/860067 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [18:44:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:45:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS buster [18:45:33] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster [18:51:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host arclamp1001.mgmt.eqiad.wmnet with reboot policy FORCED [18:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P40823 and previous config saved to /var/cache/conftool/dbconfig/20221123-185233-ladsgroup.json [18:53:05] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [18:53:11] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [18:54:17] (03PS1) 10Papaul: Add contint1002 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860093 (https://phabricator.wikimedia.org/T313830) [18:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321126)', diff saved to https://phabricator.wikimedia.org/P40824 and previous config saved to /var/cache/conftool/dbconfig/20221123-185444-marostegui.json [18:54:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance [18:54:50] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:54:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance [18:55:02] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['arclamp1001'] [18:55:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T321126)', diff saved to https://phabricator.wikimedia.org/P40825 and previous config saved to /var/cache/conftool/dbconfig/20221123-185505-marostegui.json [18:56:11] !log btullis@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:58:02] (03CR) 10Dzahn: [V: 03+1] "compiles now with no diff:) https://puppet-compiler.wmflabs.org/output/852260/38414/" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [18:59:07] (03CR) 10Dzahn: "Would be great if this could be done with approval of serviceops-core team." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [18:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321126)', diff saved to https://phabricator.wikimedia.org/P40826 and previous config saved to /var/cache/conftool/dbconfig/20221123-185920-marostegui.json [18:59:38] (03CR) 10Dzahn: [V: 04-1] "to be merged during next migration window" [puppet] - 10https://gerrit.wikimedia.org/r/859628 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:59:41] (03CR) 10Papaul: [C: 03+2] Add contint1002 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860093 (https://phabricator.wikimedia.org/T313830) (owner: 10Papaul) [19:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T1900) [19:02:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab, 10Patch-For-Review: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Dzahn) >>! In T313830#8366381, @hashar wrote: > Given this task to replace contint1001, its IPv4 address can be reclaimed once the migration has complete... [19:03:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) [19:03:54] (03PS1) 10Ssingh: lvs4010: set as secondary LVS and remove lvs4007 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/860094 (https://phabricator.wikimedia.org/T317247) [19:04:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab, 10Patch-For-Review: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) [19:04:40] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1057.mgmt.eqiad.wmnet with reboot policy FORCED [19:05:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['arclamp1001'] [19:05:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [19:06:17] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1058.mgmt.eqiad.wmnet with reboot policy FORCED [19:06:44] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:06:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [19:07:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) [19:07:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40827 and previous config saved to /var/cache/conftool/dbconfig/20221123-190739-ladsgroup.json [19:07:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:07:46] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:07:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:08:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40828 and previous config saved to /var/cache/conftool/dbconfig/20221123-190812-ladsgroup.json [19:09:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [19:09:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host contint1002.wikimedia.org with OS buster [19:09:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab, 10Patch-For-Review: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host contint1002.wikimedia.org with OS buster [19:11:52] (03PS1) 10Jdlrobson: Update ky wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T322041) [19:13:21] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [19:13:27] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [19:14:09] (03PS2) 10Jdlrobson: Update ky wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) [19:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40829 and previous config saved to /var/cache/conftool/dbconfig/20221123-191427-marostegui.json [19:15:27] (03PS1) 10Papaul: Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330) [19:16:03] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [19:16:05] (03CR) 10CI reject: [V: 04-1] Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330) (owner: 10Papaul) [19:16:10] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [19:18:39] (03PS2) 10Papaul: Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330) [19:20:12] (03CR) 10Papaul: [C: 03+2] Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330) (owner: 10Papaul) [19:21:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1002.wikimedia.org with reason: host reimage [19:24:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1002.wikimedia.org with reason: host reimage [19:26:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS buster [19:26:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster completed: - lvs4010 (**... [19:28:04] (03CR) 10Ryan Kemper: elastic: change java GC options to default for ES7 (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [19:28:21] (03CR) 10Ryan Kemper: "(Had forgotten to publish draft comments so just published them)" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [19:29:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp1001.eqiad.wmnet with OS bullseye [19:29:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [19:29:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40830 and previous config saved to /var/cache/conftool/dbconfig/20221123-192934-marostegui.json [19:32:49] (03PS1) 10Jbond: install_server: use cut instead of awk [puppet] - 10https://gerrit.wikimedia.org/r/860101 [19:33:26] (03PS1) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [19:34:04] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [19:34:55] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1058.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:11] (03PS2) 10Jbond: install_server: use cut instead of awk [puppet] - 10https://gerrit.wikimedia.org/r/860101 [19:35:46] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:47] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs4010 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/860089 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [19:37:08] !log running homer for Gerrit: 860089 [19:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:47] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED [19:37:53] !log phab1004 - re-enabling puppet - phd should stay stopped, dumps and logmail should keep running [19:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:33] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED [19:38:59] !log [done] running homer for Gerrit: 860089 [19:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint1002.wikimedia.org with OS buster [19:39:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host contint1002.wikimedia.org with OS buster completed: - contint1002 (**PASS**) -... [19:41:08] !log decommission lvs4007: T317247 [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:14] T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 [19:41:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) [19:41:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs4007.ulsfo.wmnet [19:42:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) 05Open→03Resolved @LSobanski this is done [19:42:31] (03CR) 10Jbond: [C: 03+2] install_server: use cut instead of awk [puppet] - 10https://gerrit.wikimedia.org/r/860101 (owner: 10Jbond) [19:43:17] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [19:43:24] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [19:44:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321126)', diff saved to https://phabricator.wikimedia.org/P40831 and previous config saved to /var/cache/conftool/dbconfig/20221123-194441-marostegui.json [19:44:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:44:47] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:44:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:45:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance [19:45:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance [19:45:32] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:45:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:45:46] ^ expected [19:45:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance [19:45:49] ack [19:45:55] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [19:46:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance [19:46:01] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [19:46:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2108.codfw.wmnet with reason: Maintenance [19:46:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2108.codfw.wmnet with reason: Maintenance [19:46:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T321126)', diff saved to https://phabricator.wikimedia.org/P40832 and previous config saved to /var/cache/conftool/dbconfig/20221123-194646-marostegui.json [19:46:52] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs4007 [homer/public] - 10https://gerrit.wikimedia.org/r/860103 (https://phabricator.wikimedia.org/T317247) [19:48:11] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:49:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321126)', diff saved to https://phabricator.wikimedia.org/P40833 and previous config saved to /var/cache/conftool/dbconfig/20221123-194918-marostegui.json [19:49:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski) [19:49:52] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) 05Stalled→03Open a:05LSobanski→03None [19:51:09] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) also see T313830#8418218 [19:51:35] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs4007.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [19:52:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Dzahn) If this is done, I assume the IP addresses can't have stayed the same as @Hashar was asking. But given that netbox will assign one automatically that was probably neve... [19:52:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:54:35] I have some contint changes on in the dns cookbok [19:54:41] is it OK to merge those? [19:54:49] mutante: ^ since I saw your comment on the contint thing, I think :) [19:54:57] sukhe: no, it's not me :) [19:55:01] oh sorry [19:55:16] I was just wondering about related stuff [19:55:19] papaul: ^ [19:55:27] it was papaul, I just saw your last comment on IRC and assumed it was you :P [19:59:19] (03PS2) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [19:59:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs4007.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [19:59:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs4007.ulsfo.wmnet [19:59:45] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs4007.ulsfo.wmnet` - lvs4007.ulsfo.wmnet (**WARN**) - D... [19:59:55] (03CR) 10jenkins-bot: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [20:00:58] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs4007 [homer/public] - 10https://gerrit.wikimedia.org/r/860103 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [20:01:06] (03PS3) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [20:01:42] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [20:02:19] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [20:02:25] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [20:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:03:19] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [20:03:26] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [20:03:39] !log running homer for Gerrit: 860103 [20:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:15] (03PS4) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [20:04:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40835 and previous config saved to /var/cache/conftool/dbconfig/20221123-200424-marostegui.json [20:05:00] (03CR) 10Ssingh: [C: 03+2] lvs4010: set as secondary LVS and remove lvs4007 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/860094 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [20:06:02] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for phab1004.eqiad.wmnet [20:06:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for phab1004.eqiad.wmnet [20:06:21] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ssingh) [20:07:26] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED [20:07:59] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [20:08:57] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED [20:11:28] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718) (owner: 10Filippo Giunchedi) [20:14:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40836 and previous config saved to /var/cache/conftool/dbconfig/20221123-201407-ladsgroup.json [20:14:14] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:19:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40837 and previous config saved to /var/cache/conftool/dbconfig/20221123-201931-marostegui.json [20:20:06] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED [20:20:29] (03PS3) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) [20:20:34] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED [20:20:55] (03PS4) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) [20:21:43] (03PS5) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) [20:21:57] (03PS6) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) [20:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P40838 and previous config saved to /var/cache/conftool/dbconfig/20221123-202914-ladsgroup.json [20:34:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321126)', diff saved to https://phabricator.wikimedia.org/P40839 and previous config saved to /var/cache/conftool/dbconfig/20221123-203437-marostegui.json [20:34:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:34:44] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:34:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:35:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T321126)', diff saved to https://phabricator.wikimedia.org/P40840 and previous config saved to /var/cache/conftool/dbconfig/20221123-203459-marostegui.json [20:37:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321126)', diff saved to https://phabricator.wikimedia.org/P40841 and previous config saved to /var/cache/conftool/dbconfig/20221123-203731-marostegui.json [20:38:10] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [20:38:17] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [20:40:56] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [20:41:03] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [20:41:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host arclamp1001.eqiad.wmnet with OS bullseye [20:41:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye executed with errors: - arclamp1001... [20:44:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P40842 and previous config saved to /var/cache/conftool/dbconfig/20221123-204420-ladsgroup.json [20:46:01] 10SRE, 10Tracking-Neverending: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10Aklapper) [20:48:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40843 and previous config saved to /var/cache/conftool/dbconfig/20221123-204816-ladsgroup.json [20:48:23] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:50:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) @Dzahn yes the server has a Public IP address [20:52:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp1001.eqiad.wmnet with OS bullseye [20:52:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [20:52:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40844 and previous config saved to /var/cache/conftool/dbconfig/20221123-205238-marostegui.json [20:56:14] * TheresNoTime is going to be unavailable for deploy this evening [20:56:36] (03PS5) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [20:57:55] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be2050.codfw.wmnet with OS bullseye [20:58:01] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [20:59:00] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [20:59:17] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [20:59:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40845 and previous config saved to /var/cache/conftool/dbconfig/20221123-205926-ladsgroup.json [20:59:27] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [20:59:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [20:59:32] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:59:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T2100). [21:00:05] cirno and jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] o/ [21:00:26] hi ! i can deploy [21:01:18] o/ [21:01:48] cirno: i'll start with your patch [21:02:11] jan_drewniak: nice to see you here :) do you want to self-deploy after i finish the current patch? [21:03:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P40846 and previous config saved to /var/cache/conftool/dbconfig/20221123-210322-ladsgroup.json [21:04:00] cjming: I haven't done one of these in a while, is it basically these instructions? https://deploy-commands.toolforge.org/bacc/860096 [21:04:33] jan_drewniak: ya - i'm also happy to do it for you if you prefer [21:04:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [21:05:26] (03Merged) 10jenkins-bot: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [21:05:41] !log cjming@deploy1002 Started scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] [21:05:47] T323627: Update favicon and CentralAuthLoginIcon for wikifunctionswiki - https://phabricator.wikimedia.org/T323627 [21:06:13] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED [21:06:47] cjming: I don't think this patch could be tested, as beta cluster is not supported by WikimediaDebug, so can we sync directly? [21:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40848 and previous config saved to /var/cache/conftool/dbconfig/20221123-210744-marostegui.json [21:08:39] !log cjming@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. (duration: 02m 57s) [21:10:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [21:10:28] !log cjming@deploy1002 Started scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] [21:10:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:10:38] cjming: actually this scap backport command is new to me, so I wanna try it out :P [21:11:11] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED [21:11:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:11:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:11:40] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED [21:12:09] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED [21:12:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:13:55] jan_drewniak: sounds good ! i'll let you know when i'm done with cirno's patch [21:14:17] cirno: apologies - i'm having some issues with my account on the deployment server - just need a few mins to troubleshoot [21:16:52] !log cjming@deploy1002 sync-world aborted: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] (duration: 06m 24s) [21:16:53] !log cjming@deploy1002 backport aborted: (duration: 06m 39s) [21:16:58] T323627: Update favicon and CentralAuthLoginIcon for wikifunctionswiki - https://phabricator.wikimedia.org/T323627 [21:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:18:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [21:18:14] !log brennen@deploy1002 Started scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] [21:18:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P40849 and previous config saved to /var/cache/conftool/dbconfig/20221123-211829-ladsgroup.json [21:19:33] !log brennen@deploy1002 brennen and stang: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:20:28] (03CR) 10RLazarus: [C: 03+1] "LGTM: balance in the spreadsheet looks good, and the CR matches the spreadsheet." [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [21:22:13] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [21:22:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:22:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321126)', diff saved to https://phabricator.wikimedia.org/P40850 and previous config saved to /var/cache/conftool/dbconfig/20221123-212250-marostegui.json [21:22:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:22:57] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:23:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:23:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:23:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:23:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T321126)', diff saved to https://phabricator.wikimedia.org/P40851 and previous config saved to /var/cache/conftool/dbconfig/20221123-212312-marostegui.json [21:23:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:24:43] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] (duration: 06m 29s) [21:24:49] T323627: Update favicon and CentralAuthLoginIcon for wikifunctionswiki - https://phabricator.wikimedia.org/T323627 [21:25:04] jan_drewniak: feel free to try your patch -- i'm curious if you get prompted for a sudo pw when you sync [21:25:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321126)', diff saved to https://phabricator.wikimedia.org/P40852 and previous config saved to /var/cache/conftool/dbconfig/20221123-212543-marostegui.json [21:25:56] i got stalled by prod and need to file a ticket for my account [21:26:37] cirno: your patch is live - purging files now [21:28:12] cjming: ok I'm giving it a shot [21:28:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) (owner: 10Jdlrobson) [21:29:40] (03Merged) 10jenkins-bot: Update ky wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) (owner: 10Jdlrobson) [21:29:53] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] [21:29:59] T323722: Deploy new logo to Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T323722 [21:30:33] cjming: yup, looks like I'm prompted for a sudo password. Not sure what to do there... [21:30:59] jan_drewniak: yeah, same glitch cjming was running into, i can go ahead and run it since it seems to work for me [21:31:14] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be2050.codfw.wmnet with OS bullseye [21:31:18] jan_drewniak: gtk - i'll file a ticket and include that you got this prompt too [21:31:20] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [21:31:26] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED [21:31:29] brennen: thanks that'd be great! [21:31:32] !log jdrewniak@deploy1002 sync-world aborted: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] (duration: 01m 38s) [21:31:32] !log jdrewniak@deploy1002 backport aborted: (duration: 02m 40s) [21:31:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) (owner: 10Jdlrobson) [21:31:49] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [21:31:55] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [21:31:56] !log brennen@deploy1002 Started scap: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] [21:32:11] I mean it does say `21:29:56 Running sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release` so... [21:32:43] cirno: brennen sync'd and purged your files - should be live [21:32:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:15] !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:33:17] cjming: confirmed, thanks [21:33:20] jan_drewniak: yeah, i think there's probably just a mismatch on group membership or something here, we'll dig in a bit [21:33:33] jan_drewniak: on text boxen... [21:33:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40853 and previous config saved to /var/cache/conftool/dbconfig/20221123-213335-ladsgroup.json [21:33:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [21:33:42] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [21:33:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [21:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T323214)', diff saved to https://phabricator.wikimedia.org/P40854 and previous config saved to /var/cache/conftool/dbconfig/20221123-213357-ladsgroup.json [21:34:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:34:07] brennen: ok, I see the change on wmdebug1002, looks good to sync [21:34:10] (03PS1) 10Jbond: install_server: fix config for ms-be dynamic partition [puppet] - 10https://gerrit.wikimedia.org/r/860114 [21:34:18] cool, going ahead [21:34:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:34:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:35:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:35:40] (03CR) 10RLazarus: [C: 03+1] conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [21:35:52] !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054'] [21:38:13] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] (duration: 06m 17s) [21:38:19] T323722: Deploy new logo to Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T323722 [21:38:24] jan_drewniak: {{done}} [21:38:34] brennen: thanks! [21:38:38] !log end of utc late backport and config window [21:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:43] (03PS1) 10Stang: wikidatawiki: Add language-specific logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860117 (https://phabricator.wikimedia.org/T323734) [21:39:42] jan_drewniak: if you want to add any other details - i mentioned you on the ticket https://phabricator.wikimedia.org/T323735 [21:40:19] cjming: thanks! I think that pretty much sums it up :) [21:40:31] (03CR) 10Reedy: [C: 03+2] Partial Revert "Remove pre PHP 7.4 serialize()/unserialize()" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/860030 (https://phabricator.wikimedia.org/T323236) (owner: 10Reedy) [21:40:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40855 and previous config saved to /var/cache/conftool/dbconfig/20221123-214050-marostegui.json [21:44:00] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1054'] [21:44:15] !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054'] [21:45:01] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1054'] [21:46:27] (03PS1) 10Dzahn: Revert "Revert "hieradata: switch active Phabricator server to phab1004"" [puppet] - 10https://gerrit.wikimedia.org/r/860031 [21:46:41] (03PS1) 10Dzahn: Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032 [21:47:35] (03CR) 10CI reject: [V: 04-1] Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032 (owner: 10Dzahn) [21:48:22] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [21:48:29] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [21:48:40] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [21:48:47] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [21:54:32] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host arclamp1001.eqiad.wmnet with OS bullseye [21:54:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye executed with errors: - arclamp1001... [21:54:41] (03Merged) 10jenkins-bot: Partial Revert "Remove pre PHP 7.4 serialize()/unserialize()" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/860030 (https://phabricator.wikimedia.org/T323236) (owner: 10Reedy) [21:55:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:55:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40857 and previous config saved to /var/cache/conftool/dbconfig/20221123-215557-marostegui.json [21:56:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:56:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:57:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:59:50] !log reedy@deploy1002 Synchronized php-1.40.0-wmf.10/includes/language/Message.php: T323236 (duration: 04m 35s) [21:59:56] T323236: PHP Warning: Class RawMessage has no unserializer - https://phabricator.wikimedia.org/T323236 [22:02:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:02:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:02:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:03:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:11:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321126)', diff saved to https://phabricator.wikimedia.org/P40858 and previous config saved to /var/cache/conftool/dbconfig/20221123-221103-marostegui.json [22:11:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2122.codfw.wmnet with reason: Maintenance [22:11:10] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [22:11:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2122.codfw.wmnet with reason: Maintenance [22:11:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T321126)', diff saved to https://phabricator.wikimedia.org/P40859 and previous config saved to /var/cache/conftool/dbconfig/20221123-221125-marostegui.json [22:13:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321126)', diff saved to https://phabricator.wikimedia.org/P40860 and previous config saved to /var/cache/conftool/dbconfig/20221123-221356-marostegui.json [22:21:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323214)', diff saved to https://phabricator.wikimedia.org/P40861 and previous config saved to /var/cache/conftool/dbconfig/20221123-222105-ladsgroup.json [22:21:12] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:22:17] (03PS1) 10Brennen Bearnes: sudo: add update-mediawiki-tools release to deployers [puppet] - 10https://gerrit.wikimedia.org/r/860121 (https://phabricator.wikimedia.org/T323735) [22:25:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [22:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:26:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [22:26:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:26:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:26:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40862 and previous config saved to /var/cache/conftool/dbconfig/20221123-222627-ladsgroup.json [22:26:33] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:29:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40864 and previous config saved to /var/cache/conftool/dbconfig/20221123-222903-marostegui.json [22:30:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:32:30] Well damn, I hope I didn't just break deployment-prep; I hope nothing relied on the sessionstore instance there [22:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P40865 and previous config saved to /var/cache/conftool/dbconfig/20221123-223611-ladsgroup.json [22:40:52] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2050.codfw.wmnet with OS bullseye [22:40:59] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [22:44:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40866 and previous config saved to /var/cache/conftool/dbconfig/20221123-224409-marostegui.json [22:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P40868 and previous config saved to /var/cache/conftool/dbconfig/20221123-225118-ladsgroup.json [22:59:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321126)', diff saved to https://phabricator.wikimedia.org/P40869 and previous config saved to /var/cache/conftool/dbconfig/20221123-225916-marostegui.json [22:59:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:59:23] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [22:59:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:59:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T321126)', diff saved to https://phabricator.wikimedia.org/P40870 and previous config saved to /var/cache/conftool/dbconfig/20221123-225937-marostegui.json [23:02:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321126)', diff saved to https://phabricator.wikimedia.org/P40871 and previous config saved to /var/cache/conftool/dbconfig/20221123-230209-marostegui.json [23:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323214)', diff saved to https://phabricator.wikimedia.org/P40872 and previous config saved to /var/cache/conftool/dbconfig/20221123-230624-ladsgroup.json [23:06:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [23:06:32] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:06:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [23:17:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40874 and previous config saved to /var/cache/conftool/dbconfig/20221123-231716-marostegui.json [23:20:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse) [23:22:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse) The request checklist for access is completed. I think we can merge patch [[ https://gerrit.wikimedia.org/r/c/854952 | #854952 ]]... [23:23:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse) [23:23:59] (03CR) 10Andrea Denisse: [C: 03+1] "Approving because the access checklist is completed." [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi) [23:25:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10andrea.denisse) [23:32:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40875 and previous config saved to /var/cache/conftool/dbconfig/20221123-233222-marostegui.json [23:32:55] (03PS1) 10Ebernhardson: mjolnir msearch: Reduce allowed concurrency [puppet] - 10https://gerrit.wikimedia.org/r/860129 (https://phabricator.wikimedia.org/T318575) [23:47:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321126)', diff saved to https://phabricator.wikimedia.org/P40876 and previous config saved to /var/cache/conftool/dbconfig/20221123-234729-marostegui.json [23:47:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:47:36] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [23:47:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:47:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:48:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:48:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T321126)', diff saved to https://phabricator.wikimedia.org/P40877 and previous config saved to /var/cache/conftool/dbconfig/20221123-234806-marostegui.json [23:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321126)', diff saved to https://phabricator.wikimedia.org/P40878 and previous config saved to /var/cache/conftool/dbconfig/20221123-235037-marostegui.json [23:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40879 and previous config saved to /var/cache/conftool/dbconfig/20221123-235928-ladsgroup.json [23:59:35] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214