[00:04:02] <icinga-wm>	 PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:11:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P40699 and previous config saved to /var/cache/conftool/dbconfig/20221123-001147-marostegui.json
[00:12:54] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:14:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1004.eqiad.wmnet with OS bullseye
[00:14:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov1004.eqiad.wmnet with OS bullseye completed: - dbprov1004 (**WARN**)...
[00:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:26:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321130)', diff saved to https://phabricator.wikimedia.org/P40700 and previous config saved to /var/cache/conftool/dbconfig/20221123-002654-marostegui.json
[00:26:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2130.codfw.wmnet with reason: Maintenance
[00:27:01] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[00:27:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2130.codfw.wmnet with reason: Maintenance
[00:27:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321130)', diff saved to https://phabricator.wikimedia.org/P40701 and previous config saved to /var/cache/conftool/dbconfig/20221123-002716-marostegui.json
[00:39:07] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:40:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321130)', diff saved to https://phabricator.wikimedia.org/P40702 and previous config saved to /var/cache/conftool/dbconfig/20221123-004005-marostegui.json
[00:40:11] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[00:44:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:45:37] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:55:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P40703 and previous config saved to /var/cache/conftool/dbconfig/20221123-005511-marostegui.json
[00:59:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[00:59:46] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2041.codfw.wmnet with OS bullseye
[00:59:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye
[00:59:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**)   - Removed from Pu...
[01:00:55] <sukhe>	 !log sudo rm /etc/dhcp/automation/ttyS1-115200/cp2041.conf
[01:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[01:01:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye
[01:04:57] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[01:10:15] <wikibugs>	 (03PS3) 10Krinkle: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142)
[01:10:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P40704 and previous config saved to /var/cache/conftool/dbconfig/20221123-011018-marostegui.json
[01:10:25] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle)
[01:11:06] <wikibugs>	 (03Merged) 10jenkins-bot: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle)
[01:11:31] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye
[01:11:38] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**)   - Removed from Pu...
[01:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:15:57] <icinga-wm>	 PROBLEM - Host cp2042 is DOWN: PING CRITICAL - Packet loss = 100%
[01:16:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[01:20:55] <icinga-wm>	 RECOVERY - Host cp2042 is UP: PING WARNING - Packet loss = 75%, RTA = 33.20 ms
[01:25:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321130)', diff saved to https://phabricator.wikimedia.org/P40705 and previous config saved to /var/cache/conftool/dbconfig/20221123-012524-marostegui.json
[01:25:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance
[01:25:31] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[01:25:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance
[01:29:25] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye
[01:29:43] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[01:36:06] <wikibugs>	 (03PS2) 10Krinkle: admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog)
[01:36:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance
[01:36:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance
[01:36:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321130)', diff saved to https://phabricator.wikimedia.org/P40706 and previous config saved to /var/cache/conftool/dbconfig/20221123-013627-marostegui.json
[01:36:33] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog)
[01:43:04] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog)
[01:43:31] <wikibugs>	 (03PS3) 10Ssingh: admin: Update phedenskogs ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/857529 (owner: 10Phedenskog)
[01:43:52] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye
[01:49:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321130)', diff saved to https://phabricator.wikimedia.org/P40707 and previous config saved to /var/cache/conftool/dbconfig/20221123-014912-marostegui.json
[01:49:18] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[01:56:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:00:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:04:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P40708 and previous config saved to /var/cache/conftool/dbconfig/20221123-020418-marostegui.json
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:14:32] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage
[02:15:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[02:15:44] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:15] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage
[02:19:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P40709 and previous config saved to /var/cache/conftool/dbconfig/20221123-021925-marostegui.json
[02:19:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[02:21:33] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul)
[02:27:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10andrea.denisse)
[02:27:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2041']
[02:27:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul) 05Open→03Resolved @jcrespo this is done
[02:27:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[02:30:05] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:30:13] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[02:31:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:34:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321130)', diff saved to https://phabricator.wikimedia.org/P40710 and previous config saved to /var/cache/conftool/dbconfig/20221123-023431-marostegui.json
[02:34:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance
[02:34:38] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[02:34:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance
[02:34:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321130)', diff saved to https://phabricator.wikimedia.org/P40711 and previous config saved to /var/cache/conftool/dbconfig/20221123-023453-marostegui.json
[02:42:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2041.codfw.wmnet with OS bullseye
[02:47:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321130)', diff saved to https://phabricator.wikimedia.org/P40712 and previous config saved to /var/cache/conftool/dbconfig/20221123-024751-marostegui.json
[02:47:57] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[02:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:57:51] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P40713 and previous config saved to /var/cache/conftool/dbconfig/20221123-030257-marostegui.json
[03:09:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:15:13] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:18:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P40714 and previous config saved to /var/cache/conftool/dbconfig/20221123-031804-marostegui.json
[03:19:13] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:24:09] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:29:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:33:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321130)', diff saved to https://phabricator.wikimedia.org/P40715 and previous config saved to /var/cache/conftool/dbconfig/20221123-033310-marostegui.json
[03:33:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance
[03:33:17] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[03:33:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance
[03:33:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321130)', diff saved to https://phabricator.wikimedia.org/P40716 and previous config saved to /var/cache/conftool/dbconfig/20221123-033332-marostegui.json
[03:45:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321130)', diff saved to https://phabricator.wikimedia.org/P40717 and previous config saved to /var/cache/conftool/dbconfig/20221123-034554-marostegui.json
[03:46:00] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[03:51:41] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[04:01:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P40718 and previous config saved to /var/cache/conftool/dbconfig/20221123-040100-marostegui.json
[04:07:01] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:09:03] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:16:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P40719 and previous config saved to /var/cache/conftool/dbconfig/20221123-041607-marostegui.json
[04:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:21:59] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[04:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:28:25] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:29:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:30:19] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321130)', diff saved to https://phabricator.wikimedia.org/P40720 and previous config saved to /var/cache/conftool/dbconfig/20221123-043114-marostegui.json
[04:31:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance
[04:31:20] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[04:31:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance
[04:31:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40721 and previous config saved to /var/cache/conftool/dbconfig/20221123-043135-marostegui.json
[04:39:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:44:31] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40722 and previous config saved to /var/cache/conftool/dbconfig/20221123-044523-marostegui.json
[04:45:29] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[04:48:25] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:52:19] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[04:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:00:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P40723 and previous config saved to /var/cache/conftool/dbconfig/20221123-050029-marostegui.json
[05:08:33] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:15:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P40724 and previous config saved to /var/cache/conftool/dbconfig/20221123-051536-marostegui.json
[05:18:43] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:19:01] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:22:47] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:29:11] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:30:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40725 and previous config saved to /var/cache/conftool/dbconfig/20221123-053043-marostegui.json
[05:30:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance
[05:30:50] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[05:30:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance
[05:31:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40726 and previous config saved to /var/cache/conftool/dbconfig/20221123-053104-marostegui.json
[05:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:38:47] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:39:03] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:43:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40727 and previous config saved to /var/cache/conftool/dbconfig/20221123-054345-marostegui.json
[05:43:52] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[05:44:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.22:9042 on aqs1018 is OK: TCP OK - 0.000 second response time on 10.64.32.22 port 9042 https://phabricator.wikimedia.org/T93886
[05:44:57] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:53:19] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[05:57:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5309726728 and 57442 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:57:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 640697488 and 287 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:57:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11642464760 and 57468 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:58:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add conversion for ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/859567 (owner: 10Giuseppe Lavagetto)
[05:58:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P40728 and previous config saved to /var/cache/conftool/dbconfig/20221123-055852-marostegui.json
[05:59:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 540688 and 95 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:59:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 121 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:00:08] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016
[06:01:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 507952 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:02:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[06:02:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[06:02:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T321126)', diff saved to https://phabricator.wikimedia.org/P40729 and previous config saved to /var/cache/conftool/dbconfig/20221123-060228-marostegui.json
[06:02:34] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[06:04:41] <icinga-wm>	 PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:05:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321126)', diff saved to https://phabricator.wikimedia.org/P40730 and previous config saved to /var/cache/conftool/dbconfig/20221123-060500-marostegui.json
[06:07:48] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016
[06:09:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P40731 and previous config saved to /var/cache/conftool/dbconfig/20221123-060956-root.json
[06:10:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2113.codfw.wmnet with reason: Maintenance
[06:10:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2113.codfw.wmnet with reason: Maintenance
[06:11:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[06:12:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[06:12:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P40732 and previous config saved to /var/cache/conftool/dbconfig/20221123-061226-root.json
[06:13:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) @wiki_willy @Jclark-ctr even if the task is stalled, just to make sure: these servers are still in rotation, Please do not decommission them until we've removed them. We need to resolve...
[06:13:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P40733 and previous config saved to /var/cache/conftool/dbconfig/20221123-061358-marostegui.json
[06:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:27:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P40734 and previous config saved to /var/cache/conftool/dbconfig/20221123-062731-root.json
[06:29:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40735 and previous config saved to /var/cache/conftool/dbconfig/20221123-062905-marostegui.json
[06:29:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance
[06:29:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance
[06:29:11] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[06:39:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance
[06:39:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance
[06:39:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T321130)', diff saved to https://phabricator.wikimedia.org/P40736 and previous config saved to /var/cache/conftool/dbconfig/20221123-063932-marostegui.json
[06:39:38] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[06:42:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P40737 and previous config saved to /var/cache/conftool/dbconfig/20221123-064236-root.json
[06:51:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321130)', diff saved to https://phabricator.wikimedia.org/P40738 and previous config saved to /var/cache/conftool/dbconfig/20221123-065153-marostegui.json
[06:51:59] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[06:57:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P40739 and previous config saved to /var/cache/conftool/dbconfig/20221123-065741-root.json
[06:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:04:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:05:41] <icinga-wm>	 RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:07:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P40740 and previous config saved to /var/cache/conftool/dbconfig/20221123-070659-marostegui.json
[07:07:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:12:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P40741 and previous config saved to /var/cache/conftool/dbconfig/20221123-071246-root.json
[07:20:03] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 28307706224 and 1439 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:20:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45376667648 and 1451 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:20:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49965269256 and 1481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:20:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25276122560 and 1481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:20:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31310753096 and 1482 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:22:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P40742 and previous config saved to /var/cache/conftool/dbconfig/20221123-072208-marostegui.json
[07:37:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321130)', diff saved to https://phabricator.wikimedia.org/P40743 and previous config saved to /var/cache/conftool/dbconfig/20221123-073714-marostegui.json
[07:37:21] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[07:37:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2112.codfw.wmnet with reason: Maintenance
[07:37:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2112.codfw.wmnet with reason: Maintenance
[07:40:19] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:50] <wikibugs>	 (03CR) 10ArielGlenn: "Still a pcc failure https://puppet-compiler.wmflabs.org/output/852260/38398/clouddumps1001.wikimedia.org/change.clouddumps1001.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[07:44:12] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) I have left mysql stopped so @Papaul can do the test whenever he wants.
[07:48:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:48:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:57:05] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T0800)
[08:00:05] <jouncebot>	 kart_ and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:10] <wikibugs>	 (03PS2) 10KartikMistry: Make Western Frisian Wikipedia Machine Translation stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415)
[08:00:18] * kart_ is here
[08:00:22] <_joe_>	 o/
[08:00:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1027.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[08:00:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1027.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[08:01:05] <kart_>	 _joe_: go ahead with your patch, while I just rebased my config patch.. still on CI
[08:01:33] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327)
[08:01:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327)
[08:01:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162)
[08:01:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162)
[08:01:59] <_joe_>	 kart_: tbh, I was waiting for a deployer to be around
[08:02:19] <_joe_>	 I needed a +1 on the patch at least
[08:02:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto)
[08:02:25] <_joe_>	 so I will just wait
[08:02:29] <kart_>	 OK. I'll go with my patch first and see if anyone around.
[08:03:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415) (owner: 10KartikMistry)
[08:03:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)
[08:04:14] <wikibugs>	 (03Merged) 10jenkins-bot: Make Western Frisian Wikipedia Machine Translation stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415) (owner: 10KartikMistry)
[08:04:32] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:859161|Make Western Frisian Wikipedia Machine Translation stricter by 10% (T323415)]]
[08:04:32] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327)
[08:04:34] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327)
[08:04:36] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162)
[08:04:38] <stashbot>	 T323415: Make Western Frisian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T323415
[08:04:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162)
[08:04:55] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:859161|Make Western Frisian Wikipedia Machine Translation stricter by 10% (T323415)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[08:05:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto)
[08:06:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)
[08:07:21] <_joe_>	 urbanecm, Amir1: around?
[08:09:55] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327)
[08:09:57] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327)
[08:09:59] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162)
[08:10:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162)
[08:12:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS bullseye
[08:12:11] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bullseye
[08:13:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:14:29] <wikibugs>	 (03CR) 10JMeybohm: "Nice! Do we want this right now? In that case we will have to do a backport release of the 0.1.x version of this chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 (owner: 10Alexandros Kosiaris)
[08:14:32] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:859161|Make Western Frisian Wikipedia Machine Translation stricter by 10% (T323415)]] (duration: 10m 00s)
[08:14:38] <stashbot>	 T323415: Make Western Frisian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T323415
[08:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:16:36] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] pontoon: Add .crt filename suffix to PKI root CA [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm)
[08:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:25:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1027.eqiad.wmnet with reason: host reimage
[08:26:06] <wikibugs>	 (03PS1) 10Marostegui: db1133: Move it to test-s4 section [puppet] - 10https://gerrit.wikimedia.org/r/859972 (https://phabricator.wikimedia.org/T322993)
[08:27:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1027.eqiad.wmnet with reason: host reimage
[08:28:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) We'll attempt to build using RQ and the Django RQ module. RQ supports basic job queuing, as well as job dependencies and the ability to get job status.  Other than supp...
[08:30:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) Basic proof-of-concept for queuing have been done of simple queues.  Remaining is the job dependency and job status. These are supported by RQ directly, but it's unclea...
[08:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:39:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:42:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1027.eqiad.wmnet with OS bullseye
[08:42:11] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bullseye completed: - ganeti1027 (**PASS**)   - Downtimed on...
[08:53:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:57:03] <wikibugs>	 (03PS5) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[08:59:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013)
[09:04:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:06:00] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[09:06:01] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[09:06:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:06:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:06:40] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:56] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:10:34] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:10:44] <icinga-wm>	 PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:11:58] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[09:12:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[09:14:43] <elukey>	 !log restart kube-apiserver on ml-serve-ctrl1001 as attempt to mitigate weird LIST latencies 
[09:14:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:15:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[09:16:32] <Emperor>	 !log set thanos ring replicas to 3.10 T311690
[09:16:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[09:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:38] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[09:18:10] <wikibugs>	 (03PS4) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621)
[09:19:29] <elukey>	 !log restart kube-apiserver on ml-staging-ctrl2001 as attempt to mitigate weird LIST latencies 
[09:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:20:49] <elukey>	 cc: klausman: --^ (both ml-serve-eqiad and staging :)
[09:21:06] <klausman>	 darn
[09:21:08] <wikibugs>	 (03PS3) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981)
[09:21:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Transferer: Enable PBKDF2 usage with 310000 iterations (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo)
[09:23:13] <wikibugs>	 (03CR) 10David Caro: "Just one comment, looks good" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[09:23:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: Update docker images to use single model server [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) (owner: 10Ilias Sarantopoulos)
[09:24:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:24:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST validatingwebhookconfigurations) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:25:10] <wikibugs>	 (03PS1) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621)
[09:25:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Add udevd to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/859975 (https://phabricator.wikimedia.org/T135991)
[09:28:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[09:29:40] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene)
[09:32:48] <wikibugs>	 (03CR) 10MVernon: swift: move ms-be2050 to new naming schema (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[09:33:27] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [analytics/turnilo/deploy@51da050]: (no justification provided)
[09:33:43] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [analytics/turnilo/deploy@51da050]: (no justification provided) (duration: 00m 15s)
[09:35:05] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] GrowthExperiments: Allow accessing NewImpact module in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[09:36:52] <wikibugs>	 (03PS1) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621)
[09:37:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32 and 7337 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[09:38:07] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621)
[09:38:41] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] swift: move ms-be2050 to new naming schema (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[09:41:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:42:19] <wikibugs>	 (03PS1) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621)
[09:42:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:42:34] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [analytics/turnilo/deploy@51da050]: (no justification provided)
[09:42:39] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [analytics/turnilo/deploy@51da050]: (no justification provided) (duration: 00m 05s)
[09:43:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add udevd to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/859975 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:45:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for virtlogd [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991)
[09:46:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:49:10] <wikibugs>	 (03PS2) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621)
[09:49:12] <wikibugs>	 (03PS2) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621)
[09:49:14] <wikibugs>	 (03PS2) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621)
[09:50:19] <wikibugs>	 (03CR) 10Raymond Ndibe: "This change is ready for review." (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[09:51:33] <wikibugs>	 (03PS3) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645)
[09:52:23] <icinga-wm>	 PROBLEM - swift eqiad object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe1001 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad
[09:55:18] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38400/console" [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[09:55:20] <wikibugs>	 (03CR) 10MVernon: swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[09:57:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:59:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[10:00:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet
[10:01:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1047: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859982 (https://phabricator.wikimedia.org/T319184)
[10:02:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:03:35] <wikibugs>	 (03PS1) 10Vgutierrez: archiva: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720)
[10:05:56] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "final sync before merging 804575 - jbond@cumin2002"
[10:06:15] <wikibugs>	 (03PS1) 10Vgutierrez: gerrit: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859986 (https://phabricator.wikimedia.org/T238720)
[10:07:17] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1998912 and 1650 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:08:15] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "final sync before merging 804575 - jbond@cumin2002"
[10:08:25] <wikibugs>	 (03PS6) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575
[10:08:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond)
[10:09:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond)
[10:10:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add udevd to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/859975 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:10:35] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff)
[10:10:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: felix: Instruct felix to set the src parameter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 (owner: 10Alexandros Kosiaris)
[10:10:53] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:11:13] <wikibugs>	 (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[10:11:19] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet
[10:11:59] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:04] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38401/console" [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[10:13:05] <wikibugs>	 (03Merged) 10jenkins-bot: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond)
[10:13:16] <wikibugs>	 (03CR) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[10:13:23] <wikibugs>	 (03CR) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[10:14:29] <wikibugs>	 (03PS4) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645)
[10:14:53] <wikibugs>	 (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[10:14:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[10:15:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:15:47] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:11] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[10:16:15] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:17] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[10:16:23] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: O:prometheus:  use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[10:16:29] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[10:16:40] <claime>	 httpbb_hourly_appserver.service < handled
[10:18:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 393280 and 2299 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:18:09] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:18:21] <vgutierrez>	 moritzm: keyholder seems to be unhappy in cumin1001
[10:18:56] <vgutierrez>	 oh I just saw -sre, sorry
[10:20:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[10:20:47] <wikibugs>	 (03CR) 10Jcrespo: "Deploying as is- this doesn't fix all issues, but it is not a bad thing to merge." [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo)
[10:20:51] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo)
[10:20:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Transferer: Enable PBKDF2 usage with 310000 iterations [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo)
[10:21:05] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Update changelog for release 1.1 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859446 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo)
[10:21:38] <wikibugs>	 (03CR) 10Jcrespo: ""man transferpy" FYI" [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455 (owner: 10Jcrespo)
[10:21:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add man page for transfer.py executable [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455 (owner: 10Jcrespo)
[10:23:09] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans)
[10:23:17] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) a:05Volans→03None
[10:27:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:29:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[10:31:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 40 and 3085 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:31:43] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "I'm a bit late to the party, but thanks elukey. Looks good." [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:33:49] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1015832 and 3241 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:37:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] openstack: cloudvirt: unset_maintenance: clarify SAL message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859451 (owner: 10Arturo Borrero Gonzalez)
[10:37:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[10:37:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:37:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[10:38:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40745 and previous config saved to /var/cache/conftool/dbconfig/20221123-103805-marostegui.json
[10:38:06] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[10:38:11] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[10:38:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudvirt: unset_maintenance: clarify SAL message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859451 (owner: 10Arturo Borrero Gonzalez)
[10:39:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1133: Move it to test-s4 section [puppet] - 10https://gerrit.wikimedia.org/r/859972 (https://phabricator.wikimedia.org/T322993) (owner: 10Marostegui)
[10:40:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40746 and previous config saved to /var/cache/conftool/dbconfig/20221123-104023-marostegui.json
[10:42:30] <wikibugs>	 (03PS1) 10Kosta Harlan: [WIP] GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686)
[10:42:55] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:42:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:45:24] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:46:41] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:47:47] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:48:17] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:48:17] <wikibugs>	 (03PS1) 10Jbond: swift: base the object number on the scsi path [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677)
[10:49:05] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:50:48] <wikibugs>	 (03PS2) 10Jbond: swift: base the object number on the scsi path [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677)
[10:51:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38403/console" [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[10:51:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:52:28] <elukey>	 this is due to a deployment --^
[10:52:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] swift: base the object number on the scsi path [puppet] - 10https://gerrit.wikimedia.org/r/859992 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[10:54:05] <wikibugs>	 (03PS2) 10Kosta Harlan: [WIP] GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686)
[10:54:07] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Remove unused GEHomepageNewAccountVariants config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995
[10:55:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40747 and previous config saved to /var/cache/conftool/dbconfig/20221123-105529-marostegui.json
[10:56:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:57:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "To check the reassigning logic, please see" [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto)
[10:57:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[10:58:06] <wikibugs>	 (03PS2) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677)
[10:59:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM, can we deploy the releases right away or do we need to wait on the hosts actually being in production?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[11:01:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) p:05Triage→03Medium
[11:01:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:05:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1047: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859982 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:05:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[11:06:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1047: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859982 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:06:52] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[11:07:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1047.eqiad.wmnet with O...
[11:09:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: Add mw-web, mw-api-ext [dns] - 10https://gerrit.wikimedia.org/r/859571 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:09:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:10:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40748 and previous config saved to /var/cache/conftool/dbconfig/20221123-111036-marostegui.json
[11:10:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:48] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Test - volans@cumin1001"
[11:13:05] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Vgutierrez) @legoktm it looks like the easiest approach would be adding lists1001 as a backend server on ATS and set the caching policy to `pass`. Under this scenario, lists.wikimed...
[11:13:23] <wikibugs>	 (03PS3) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677)
[11:13:41] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10akosiaris) Hey everyone,  I think this discussion would benefit greatly from a higher bandwidth venue than phabricator. It's quite clear there are pain points regarding t...
[11:14:13] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Test - volans@cumin1001"
[11:14:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wmnet: Add mw-web, mw-api-ext [dns] - 10https://gerrit.wikimedia.org/r/859571 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:15:03] <claime>	 !log Adding mw-web and mw-api-ext to wmnet dns
[11:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:22] <wikibugs>	 (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:16:40] <claime>	 Hold up, that change isn't actually good, fixing.
[11:16:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is mediawiki, so we need a bit more refinement around discovery records." [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:16:53] <wikibugs>	 (03PS1) 10Cathal Mooney: Move if statement around 'ospf' section in asw template [homer/public] - 10https://gerrit.wikimedia.org/r/860003 (https://phabricator.wikimedia.org/T312635)
[11:17:00] <wikibugs>	 (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:18:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::catalog: Add mw-web and mw-api-ext (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:18:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::catalog: Add mw-web and mw-api-ext (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:19:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:19:17] <wikibugs>	 (03PS2) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621)
[11:19:19] <wikibugs>	 (03PS1) 10Clément Goubert: wmnet: Fix mw-web, mw-api-ext codfw [dns] - 10https://gerrit.wikimedia.org/r/860004 (https://phabricator.wikimedia.org/T323621)
[11:19:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move if statement around 'ospf' section in asw template [homer/public] - 10https://gerrit.wikimedia.org/r/860003 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[11:19:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:20:13] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[11:20:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: Fix mw-web, mw-api-ext codfw [dns] - 10https://gerrit.wikimedia.org/r/860004 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:20:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[11:21:00] <wikibugs>	 (03Merged) 10jenkins-bot: Move if statement around 'ospf' section in asw template [homer/public] - 10https://gerrit.wikimedia.org/r/860003 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[11:21:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wmnet: Fix mw-web, mw-api-ext codfw [dns] - 10https://gerrit.wikimedia.org/r/860004 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:22:50] <claime>	 !log authdns-update for mw-web and mw-api-ext 
[11:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:04] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[11:23:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, looking a bit into it, will it use this feature? (from https://libvirt.org/manpages/virtlogd.html)" [puppet] - 10https://gerrit.wikimedia.org/r/859980 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:24:38] <topranks>	 !log changing port-speed configuration syntax on asw1-b12-drmrs
[11:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40750 and previous config saved to /var/cache/conftool/dbconfig/20221123-112542-marostegui.json
[11:25:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:25:48] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[11:25:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:26:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40751 and previous config saved to /var/cache/conftool/dbconfig/20221123-112604-marostegui.json
[11:26:56] <wikibugs>	 (03CR) 10Jelto: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[11:28:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40752 and previous config saved to /var/cache/conftool/dbconfig/20221123-112821-marostegui.json
[11:28:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:30:13] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: Remove unused GEHomepageNewAccountVariants config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859995 (owner: 10Kosta Harlan)
[11:30:17] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 14379970856 and 1048 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:33:51] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:34:38] <wikibugs>	 (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:36:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet
[11:36:19] <icinga-wm>	 PROBLEM - Check systemd state on cp5020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:19] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:37:37] <wikibugs>	 (03PS1) 10Jbond: puppetdb: add cpu_flags back to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/860006
[11:38:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: add cpu_flags back to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/860006 (owner: 10Jbond)
[11:39:17] <wikibugs>	 (03PS5) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621)
[11:39:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet
[11:40:12] <wikibugs>	 (03CR) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:42:22] <Lucas_WMDE>	 jouncebot: nowandnext
[11:42:22] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 17 minute(s)
[11:42:23] <jouncebot>	 In 2 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T1400)
[11:42:36] <vgutierrez>	 jbond: ^^ are you playing with cp5020?
[11:42:37] <wikibugs>	 (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:42:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet
[11:42:44] <Lucas_WMDE>	 I’ll deploy a security patch if nobody objects
[11:43:24] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38404/console" [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:43:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40753 and previous config saved to /var/cache/conftool/dbconfig/20221123-114327-marostegui.json
[11:44:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:44:09] <vgutierrez>	 https://www.irccloud.com/pastebin/vT1kDXHp/
[11:44:17] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:45:10] <wikibugs>	 (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:45:12] <vgutierrez>	 jbond: ^^ random error?
[11:45:17] <wikibugs>	 (03PS3) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621)
[11:46:04] <wikibugs>	 (03CR) 10David Caro: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[11:46:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet
[11:46:45] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0236 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:49:17] <jbond>	 this was me it shuld clear soon 
[11:50:37] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:50:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:51:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Active/passive records have a different type." [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:51:17] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36124399688 and 1634 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:52:19] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0004914 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:52:42] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[11:52:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu...
[11:53:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu...
[11:55:10] <moritzm>	 !log updating mw canaries to 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 T323358
[11:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:25] <wikibugs>	 (03PS4) 10Clément Goubert: mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621)
[11:56:55] <wikibugs>	 (03PS10) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933
[11:57:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:57:23] <wikibugs>	 (03CR) 10Clément Goubert: mw-web, mw-api-ext: add discovery records (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[11:57:27] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[11:57:54] <wikibugs>	 (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[11:58:09] <Lucas_WMDE>	 I’m deploying my patch now ftr
[11:58:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40754 and previous config saved to /var/cache/conftool/dbconfig/20221123-115834-marostegui.json
[12:01:20] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[12:01:43] <logmsgbot>	 !log lucaswerkmeister-wmde: Deployed security patch for T323592
[12:02:18] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add conversion for ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/859567
[12:03:32] * Lucas_WMDE done
[12:04:27] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1046: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860010 (https://phabricator.wikimedia.org/T319184)
[12:04:34] <wikibugs>	 (03PS3) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621)
[12:04:36] <wikibugs>	 (03PS3) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621)
[12:04:38] <wikibugs>	 (03PS3) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621)
[12:04:42] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[12:05:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[12:05:50] <wikibugs>	 (03CR) 10Clément Goubert: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[12:06:12] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38405/console" [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[12:06:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[12:06:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[12:07:17] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38406/console" [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[12:07:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[12:09:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13384325296 and 1123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:09:37] <wikibugs>	 (03PS16) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[12:10:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[12:10:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:11:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[12:13:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40755 and previous config saved to /var/cache/conftool/dbconfig/20221123-121340-marostegui.json
[12:13:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[12:13:47] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[12:13:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[12:14:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T321126)', diff saved to https://phabricator.wikimedia.org/P40756 and previous config saved to /var/cache/conftool/dbconfig/20221123-121402-marostegui.json
[12:16:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321126)', diff saved to https://phabricator.wikimedia.org/P40758 and previous config saved to /var/cache/conftool/dbconfig/20221123-121618-marostegui.json
[12:17:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/860012 (https://phabricator.wikimedia.org/T135991)
[12:18:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "LGTM. Context from bpirkle:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto)
[12:18:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[12:18:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1046: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860010 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:18:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1046: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860010 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:19:08] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mnz-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:19:27] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bullseye
[12:19:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1046.eqiad.wmnet with O...
[12:20:50] <icinga-wm>	 RECOVERY - Check systemd state on cp5020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38631679832 and 2361 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:22:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2904 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:24:04] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:25:14] <wikibugs>	 (03CR) 10Jbond: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[12:26:06] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858659 (https://phabricator.wikimedia.org/T306200) (owner: 10Andrew Bogott)
[12:26:39] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service::catalog: mw-web and mw-api-ext to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/859974 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[12:28:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 310 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:28:13] <wikibugs>	 (03PS4) 10David Caro: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez)
[12:30:20] <wikibugs>	 (03Abandoned) 10Jbond: puppet_compiler: drop yaml dir from export facts tar ball [puppet] - 10https://gerrit.wikimedia.org/r/745990 (owner: 10Jbond)
[12:31:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez)
[12:31:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40759 and previous config saved to /var/cache/conftool/dbconfig/20221123-123125-marostegui.json
[12:31:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 525 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:32:15] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621)
[12:32:20] <stashbot>	 T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621
[12:32:37] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage
[12:32:42] <claime>	 !log restarting pybal on lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet for mw-web and mw-api-ext behind LVS T323621
[12:32:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:23] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621)
[12:34:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add new graphite hosts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[12:36:08] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage
[12:36:45] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[12:39:43] <wikibugs>	 (03PS1) 10Clément Goubert: Fix mw-api-ext eqiad service ip [puppet] - 10https://gerrit.wikimedia.org/r/860015
[12:40:55] <wikibugs>	 (03PS2) 10Clément Goubert: service::catalog: Fix mw-api-ext eqiad service ip [puppet] - 10https://gerrit.wikimedia.org/r/860015
[12:43:13] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet
[12:44:26] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] service::catalog: Fix mw-api-ext eqiad service ip [puppet] - 10https://gerrit.wikimedia.org/r/860015 (owner: 10Clément Goubert)
[12:45:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:46:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40760 and previous config saved to /var/cache/conftool/dbconfig/20221123-124631-marostegui.json
[12:48:09] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[12:49:08] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621)
[12:49:13] <stashbot>	 T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621
[12:50:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:50:16] <wikibugs>	 (03CR) 10Muehlenhoff: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[12:50:34] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[12:52:18] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs (T323621)
[12:54:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:55:58] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs (T323621)
[12:56:04] <stashbot>	 T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621
[12:56:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:58:15] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[12:58:34] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs (T323621)
[12:59:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add logind to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/860018 (https://phabricator.wikimedia.org/T135991)
[12:59:33] <wikibugs>	 (03PS1) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019
[13:01:34] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[13:01:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321126)', diff saved to https://phabricator.wikimedia.org/P40761 and previous config saved to /var/cache/conftool/dbconfig/20221123-130138-marostegui.json
[13:01:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[13:01:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[13:01:45] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[13:01:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[13:02:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T321126)', diff saved to https://phabricator.wikimedia.org/P40762 and previous config saved to /var/cache/conftool/dbconfig/20221123-130159-marostegui.json
[13:02:23] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1046.eqiad.wmnet with OS bullseye
[13:02:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bu...
[13:02:41] <wikibugs>	 (03PS4) 10Clément Goubert: service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621)
[13:02:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:07:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:10:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:15:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:15:52] <wikibugs>	 (03PS1) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[13:17:34] <wikibugs>	 (03PS2) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019
[13:18:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860018 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:18:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38407/console" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[13:18:57] <wikibugs>	 (03PS1) 10Jaime Nuche: scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023
[13:19:02] <wikibugs>	 (03PS2) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[13:20:34] <wikibugs>	 (03PS3) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[13:23:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[13:24:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:25:16] <moritzm>	 !log installing apache security updates on mw canaries
[13:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[13:27:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:27:56] <wikibugs>	 (03CR) 10Slyngshede: "Would like feedback on general direction or obvious oversights." [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[13:31:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[13:32:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:35:00] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1045: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860047 (https://phabricator.wikimedia.org/T319184)
[13:39:16] <moritzm>	 !log updating mw canaries to 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 T323358
[13:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:44:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:46:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:35] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[13:52:38] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/860047 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[13:52:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1045: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/860047 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[13:53:41] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS bullseye
[13:53:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1045.eqiad.wmnet with O...
[13:54:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[13:56:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[13:57:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C
[13:58:27] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:11] <Lucas_WMDE>	 o/
[14:00:18] <Lucas_WMDE>	 yup, nothing in the calendar
[14:01:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[14:02:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321126)', diff saved to https://phabricator.wikimedia.org/P40763 and previous config saved to /var/cache/conftool/dbconfig/20221123-140215-marostegui.json
[14:02:22] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[14:02:37] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[14:02:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:05:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:06:32] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:48] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[14:06:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[14:07:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[14:07:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[14:07:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40764 and previous config saved to /var/cache/conftool/dbconfig/20221123-140712-ladsgroup.json
[14:07:22] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:07:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[14:07:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T323214)', diff saved to https://phabricator.wikimedia.org/P40765 and previous config saved to /var/cache/conftool/dbconfig/20221123-140732-ladsgroup.json
[14:07:39] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C
[14:08:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] service::catalog: mw-web and mw-api-ext to production [puppet] - 10https://gerrit.wikimedia.org/r/859977 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[14:08:47] <wikibugs>	 (03CR) 10Volans: "The only concern I have with this approach is that the other SREs seeing the systemd alert would not know what to do and why it's alerting" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[14:10:12] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[14:12:11] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:20] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[14:14:01] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro
[14:14:21] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro
[14:14:52] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mw-web
[14:15:12] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mw-api-ext
[14:15:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[14:15:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[14:15:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P40766 and previous config saved to /var/cache/conftool/dbconfig/20221123-141543-ladsgroup.json
[14:15:45] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: add discovery records [dns] - 10https://gerrit.wikimedia.org/r/859978 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[14:15:47] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:15:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez)
[14:17:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40767 and previous config saved to /var/cache/conftool/dbconfig/20221123-141722-marostegui.json
[14:18:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 (owner: 10Jaime Nuche)
[14:19:22] <wikibugs>	 (03PS4) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621)
[14:19:59] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: httpbb random read timeout on cumin2002 - https://phabricator.wikimedia.org/T323707 (10Volans) p:05Triage→03Medium
[14:20:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! Added also Ben and Steve so we can get the green light from the DE folks as well." [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[14:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:22:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P40768 and previous config saved to /var/cache/conftool/dbconfig/20221123-142159-ladsgroup.json
[14:22:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto)
[14:23:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add logind to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/860018 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:24:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:18] <wikibugs>	 (03PS5) 10Clément Goubert: Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621)
[14:24:41] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:25:05] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:26:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:26:33] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:26:45] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38409/console" [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[14:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:27:11] <wikibugs>	 (03Merged) 10jenkins-bot: remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto)
[14:28:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I hate this script so, so much 😄" [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[14:28:35] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 (owner: 10Jaime Nuche)
[14:29:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:02] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[14:31:08] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] Add desired states for mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859979 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[14:32:21] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks. Looks good to me. Will be on the lookout for any changes to behaviour, but don't anticipate anything." [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[14:32:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40769 and previous config saved to /var/cache/conftool/dbconfig/20221123-143228-marostegui.json
[14:32:49] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[14:33:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM,thanks." [puppet] - 10https://gerrit.wikimedia.org/r/860012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:36:25] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS bullseye
[14:36:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add Cumin alias for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff)
[14:36:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1045.eqiad.wmnet with OS bu...
[14:37:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40770 and previous config saved to /var/cache/conftool/dbconfig/20221123-143706-ladsgroup.json
[14:40:28] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[14:41:32] <moritzm>	 !log rebalance Ganeti group B/eqiad T311687
[14:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:38] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[14:43:42] <icinga-wm>	 RECOVERY - Ganeti memory on ganeti1015 is OK: OK Memory 83% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[14:44:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] archiva: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859983 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[14:47:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321126)', diff saved to https://phabricator.wikimedia.org/P40771 and previous config saved to /var/cache/conftool/dbconfig/20221123-144735-marostegui.json
[14:47:42] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[14:49:41] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs4007: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/860057
[14:49:58] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[14:52:11] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Added more downtime, but replacement batteries are on their way https://wikitech.wikimedia.org/wiki/MegaCli%23
[14:52:11] <icinga-wm>	 ng
[14:52:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40772 and previous config saved to /var/cache/conftool/dbconfig/20221123-145212-ladsgroup.json
[14:54:09] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[14:54:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[14:54:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[14:54:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:54:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:54:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T321126)', diff saved to https://phabricator.wikimedia.org/P40773 and previous config saved to /var/cache/conftool/dbconfig/20221123-145446-marostegui.json
[14:54:52] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[14:56:18] <icinga-wm>	 RECOVERY - graphite.wikimedia.org requires authentication on graphite2004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 548 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:57:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321126)', diff saved to https://phabricator.wikimedia.org/P40774 and previous config saved to /var/cache/conftool/dbconfig/20221123-145701-marostegui.json
[14:59:50] <wikibugs>	 (03PS5) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326)
[15:01:27] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert)
[15:01:37] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05In progress→03Resolved
[15:01:45] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[15:03:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[15:06:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132 Maint', diff saved to https://phabricator.wikimedia.org/P40775 and previous config saved to /var/cache/conftool/dbconfig/20221123-150621-ladsgroup.json
[15:06:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Add new graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:07:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321312)', diff saved to https://phabricator.wikimedia.org/P40776 and previous config saved to /var/cache/conftool/dbconfig/20221123-150719-ladsgroup.json
[15:08:44] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[15:08:57] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[15:09:08] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:09:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:10:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[15:10:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[15:10:45] <claime>	 !log deploying change 859575 on mw-* wikikube deployments
[15:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:20] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED
[15:11:56] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:12:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40777 and previous config saved to /var/cache/conftool/dbconfig/20221123-151207-marostegui.json
[15:13:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:15:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P40778 and previous config saved to /var/cache/conftool/dbconfig/20221123-151507-ladsgroup.json
[15:15:18] <moritzm>	 !log updating snapshot* hosts to PHP 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 T323358
[15:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:45] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:17:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez)
[15:20:10] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:20:53] <claime>	 godog: I deployed your graphite change to wikikube
[15:21:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto)
[15:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:22:26] <_joe_>	 uh sigh
[15:25:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/860012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:26:09] <wikibugs>	 (03Merged) 10jenkins-bot: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto)
[15:26:32] <James_F>	 jouncebot: now
[15:26:32] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 33 minute(s)
[15:26:35] <James_F>	 Ace
[15:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:27:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40779 and previous config saved to /var/cache/conftool/dbconfig/20221123-152714-marostegui.json
[15:28:48] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@52e4a00]: Deploying 52e4a00 for T311097 pointing Codex docs to latest
[15:28:54] <stashbot>	 T311097: docs: Consider making the latest release branch the default for the live docs site - https://phabricator.wikimedia.org/T311097
[15:29:03] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@52e4a00]: Deploying 52e4a00 for T311097 pointing Codex docs to latest (duration: 00m 14s)
[15:29:53] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED
[15:30:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P40780 and previous config saved to /var/cache/conftool/dbconfig/20221123-153012-ladsgroup.json
[15:30:16] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[15:31:05] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[15:35:55] <wikibugs>	 (03PS1) 10Ssingh: lvs4010: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/860067 (https://phabricator.wikimedia.org/T317247)
[15:35:56] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED
[15:36:18] <godog>	 claime: amazing! thank you so much <3
[15:36:33] <claime>	 godog: np <3
[15:37:27] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[15:38:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323214)', diff saved to https://phabricator.wikimedia.org/P40782 and previous config saved to /var/cache/conftool/dbconfig/20221123-153824-ladsgroup.json
[15:38:31] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:40:59] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069
[15:41:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[15:41:46] <logmsgbot>	 !log btullis@cumin2002 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[15:42:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[15:42:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321126)', diff saved to https://phabricator.wikimedia.org/P40783 and previous config saved to /var/cache/conftool/dbconfig/20221123-154220-marostegui.json
[15:42:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[15:42:27] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[15:42:30] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[15:42:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[15:42:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40784 and previous config saved to /var/cache/conftool/dbconfig/20221123-154242-marostegui.json
[15:42:44] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069
[15:44:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating for lvs4009 and lvs4010 - sukhe@cumin2002"
[15:44:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40785 and previous config saved to /var/cache/conftool/dbconfig/20221123-154459-marostegui.json
[15:45:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P40786 and previous config saved to /var/cache/conftool/dbconfig/20221123-154517-ladsgroup.json
[15:45:52] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating for lvs4009 and lvs4010 - sukhe@cumin2002"
[15:45:52] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:47:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[15:48:20] <wikibugs>	 (03PS28) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[15:48:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40787 and previous config saved to /var/cache/conftool/dbconfig/20221123-154831-ladsgroup.json
[15:48:38] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:49:28] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:50:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hiera: replace graphite2003 with 2004 for graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524)
[15:51:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[15:52:00] <wikibugs>	 (03CR) 10Muehlenhoff: install_server: Add dynamic raid configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[15:52:08] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38410/console" [puppet] - 10https://gerrit.wikimedia.org/r/860057 (owner: 10Ssingh)
[15:52:11] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MoritzMuehlenhoff) >>! In T308677#8346238, @jbond wrote: > The underlining is...
[15:52:15] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[15:52:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[15:52:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[15:53:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[15:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P40788 and previous config saved to /var/cache/conftool/dbconfig/20221123-155330-ladsgroup.json
[15:55:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38412/console" [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:56:29] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] hiera: lvs4007: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/860057 (owner: 10Ssingh)
[15:57:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hiera: replace graphite2003 with 2004 for graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:58:05] <wikibugs>	 (03PS2) 10Jforrester: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053
[15:58:07] <wikibugs>	 (03CR) 10Jforrester: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 (owner: 10Jforrester)
[15:58:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] lvs4010: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/860067 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[15:59:04] <wikibugs>	 (03PS3) 10Raymond Ndibe: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 (owner: 10Arturo Borrero Gonzalez)
[15:59:40] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:00:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40789 and previous config saved to /var/cache/conftool/dbconfig/20221123-160005-marostegui.json
[16:00:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P40790 and previous config saved to /var/cache/conftool/dbconfig/20221123-160022-ladsgroup.json
[16:01:57] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+2] tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 (owner: 10Arturo Borrero Gonzalez)
[16:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: tools-webservice: add basic README file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/860069 (owner: 10Arturo Borrero Gonzalez)
[16:03:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[16:03:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) 05Resolved→03Open @BCornwall reopening I can not find myself on [[ https://ldap.toolforge.org/ | this list ]] and and not able to log into...
[16:03:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P40791 and previous config saved to /var/cache/conftool/dbconfig/20221123-160338-ladsgroup.json
[16:06:51] <wikibugs>	 (03CR) 10Jelto: "This change is ready for review." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:07:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[16:08:07] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[16:08:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:08:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P40792 and previous config saved to /var/cache/conftool/dbconfig/20221123-160837-ladsgroup.json
[16:09:00] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:09:19] <wikibugs>	 (03PS29) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[16:09:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[16:10:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[16:11:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Remove graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718)
[16:12:26] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] scap.cfg: enable image building in production cluster [puppet] - 10https://gerrit.wikimedia.org/r/860023 (owner: 10Jaime Nuche)
[16:13:07] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: lower memory limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196)
[16:13:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmnet: replace graphite2003 with graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/860073 (https://phabricator.wikimedia.org/T315524)
[16:13:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:15:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40793 and previous config saved to /var/cache/conftool/dbconfig/20221123-161512-marostegui.json
[16:16:16] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED
[16:16:22] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:51] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED
[16:17:20] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs[2001-2004].codfw.wmnet,aqs[1010-1015].eqiad.wmnet: T314309 restarting to pick up new JRE - eevans@cumin1001
[16:17:53] <wikibugs>	 (03PS3) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019
[16:17:55] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: update documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/860074
[16:17:57] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075
[16:18:17] <wikibugs>	 (03CR) 10Muehlenhoff: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:18:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P40794 and previous config saved to /var/cache/conftool/dbconfig/20221123-161844-ladsgroup.json
[16:20:02] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[16:21:55] <wikibugs>	 (03CR) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[16:22:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "comment-only, self-merging" [dns] - 10https://gerrit.wikimedia.org/r/860073 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[16:23:31] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED
[16:23:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323214)', diff saved to https://phabricator.wikimedia.org/P40795 and previous config saved to /var/cache/conftool/dbconfig/20221123-162345-ladsgroup.json
[16:23:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:23:51] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:24:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:24:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T323214)', diff saved to https://phabricator.wikimedia.org/P40796 and previous config saved to /var/cache/conftool/dbconfig/20221123-162407-ladsgroup.json
[16:29:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, though I think we should change the default value to the link you posted:" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond)
[16:30:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond)
[16:30:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321126)', diff saved to https://phabricator.wikimedia.org/P40797 and previous config saved to /var/cache/conftool/dbconfig/20221123-163018-marostegui.json
[16:30:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[16:30:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond)
[16:30:25] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[16:30:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[16:30:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:30:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[16:31:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[16:31:13] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:31:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T321126)', diff saved to https://phabricator.wikimedia.org/P40798 and previous config saved to /var/cache/conftool/dbconfig/20221123-163115-marostegui.json
[16:33:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321126)', diff saved to https://phabricator.wikimedia.org/P40799 and previous config saved to /var/cache/conftool/dbconfig/20221123-163330-marostegui.json
[16:33:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40800 and previous config saved to /var/cache/conftool/dbconfig/20221123-163351-ladsgroup.json
[16:33:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[16:33:57] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:34:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[16:34:11] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint1002']
[16:34:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40801 and previous config saved to /var/cache/conftool/dbconfig/20221123-163412-ladsgroup.json
[16:35:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul)
[16:35:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:37:17] <wikibugs>	 (03PS6) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[16:40:22] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: lower memory limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:40:26] <wikibugs>	 (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond)
[16:40:46] <wikibugs>	 (03PS2) 10Jbond: systemd::timer::job: add monitoring_url to unit file [puppet] - 10https://gerrit.wikimedia.org/r/860075
[16:40:48] <wikibugs>	 (03PS1) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661)
[16:40:50] <wikibugs>	 (03PS1) 10Volans: WIP (to be modified) [cookbooks] - 10https://gerrit.wikimedia.org/r/860081
[16:42:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38413/console" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond)
[16:42:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply
[16:43:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply
[16:45:17] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: lower memory limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/860072 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:45:49] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply
[16:46:46] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply
[16:48:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10Jcross) Approved
[16:48:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Jcross) Approved
[16:48:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40802 and previous config saved to /var/cache/conftool/dbconfig/20221123-164837-marostegui.json
[16:49:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/860075 (owner: 10Jbond)
[16:49:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486
[16:51:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[16:51:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:52:22] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED
[16:53:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[16:55:50] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:56:14] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['contint1002']
[16:56:52] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED
[16:57:24] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[16:57:52] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED
[16:58:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul)
[17:02:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10nskaggs) It's exciting to see so many successful transitions to single NIC here already! Great work! However, I also want to ask tha...
[17:03:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40803 and previous config saved to /var/cache/conftool/dbconfig/20221123-170343-marostegui.json
[17:09:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 (owner: 10Giuseppe Lavagetto)
[17:12:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:13:28] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:14:37] <wikibugs>	 (03Merged) 10jenkins-bot: proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 (owner: 10Giuseppe Lavagetto)
[17:16:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for arclamp1001 - pt1979@cumin2002"
[17:18:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for arclamp1001 - pt1979@cumin2002"
[17:18:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:18:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:18:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321126)', diff saved to https://phabricator.wikimedia.org/P40804 and previous config saved to /var/cache/conftool/dbconfig/20221123-171850-marostegui.json
[17:18:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[17:18:56] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[17:19:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[17:19:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321126)', diff saved to https://phabricator.wikimedia.org/P40805 and previous config saved to /var/cache/conftool/dbconfig/20221123-171911-marostegui.json
[17:21:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321126)', diff saved to https://phabricator.wikimedia.org/P40806 and previous config saved to /var/cache/conftool/dbconfig/20221123-172128-marostegui.json
[17:21:39] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[17:22:09] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[17:24:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:27:11] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1054.mgmt.eqiad.wmnet with reboot policy FORCED
[17:32:12] <wikibugs>	 (03CR) 10Hashar: "Looking on gerrit1001.wikimedia.org in /var/log/apache2/gerrit.wikimedia.org.http.access.log there are only a few requests:" [puppet] - 10https://gerrit.wikimedia.org/r/859986 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[17:33:49] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs[2001-2004].codfw.wmnet,aqs[1010-1015].eqiad.wmnet: T314309 restarting to pick up new JRE - eevans@cumin1001
[17:34:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:36:11] <wikibugs>	 (03CR) 10Dzahn: "ACK, I'm starting to feel this one causes more trouble than it fixes.hmmm" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[17:36:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40807 and previous config saved to /var/cache/conftool/dbconfig/20221123-173635-marostegui.json
[17:36:55] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED
[17:37:20] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED
[17:39:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:39:57] <urandom>	 !log initiating Cassandra bootstrap, aqs1018-a -- T307802
[17:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:03] <stashbot>	 T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802
[17:41:46] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.32.31:7001 on aqs1018 is OK: SSL OK - Certificate aqs1018-b valid until 2024-11-08 15:06:27 +0000 (expires in 715 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:42:20] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:42:33] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED
[17:42:41] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020
[17:42:47] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[17:44:06] <ryankemper>	 !log [Elastic] T319020 Kicked off rolling restart of cloudelastic to apply new heap size 8->10G; see `ryankemper@cumin1001` tmux session `cloudelastic_restarts`
[17:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40809 and previous config saved to /var/cache/conftool/dbconfig/20221123-175141-marostegui.json
[17:56:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323214)', diff saved to https://phabricator.wikimedia.org/P40810 and previous config saved to /var/cache/conftool/dbconfig/20221123-175625-ladsgroup.json
[17:56:32] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:56:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:57:04] <wikibugs>	 (03PS15) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260
[17:58:08] <wikibugs>	 (03PS1) 10Reedy: Partial Revert "Remove pre PHP 7.4 serialize()/unserialize()" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/860030 (https://phabricator.wikimedia.org/T323236)
[17:58:13] <wikibugs>	 (03PS1) 10Hashar: eslint: switch to es2018 [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086
[17:59:54] <wikibugs>	 (03CR) 10Hashar: "Gerrit has a few JavaScript plugins in ./plugins which are passed through eslint. I found out I could use the little bit modern es2018 whe" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086 (owner: 10Hashar)
[18:00:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10MNadrofsky) @BTullis I approve this for @gmodena . With Will currently away, I'm acting manager for Gabriele. Let me know if you need anything else!
[18:00:44] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1055.mgmt.eqiad.wmnet with reboot policy FORCED
[18:01:25] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1056.mgmt.eqiad.wmnet with reboot policy FORCED
[18:01:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:01:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[18:02:59] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[18:03:01] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[18:03:11] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply
[18:04:05] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[18:06:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321126)', diff saved to https://phabricator.wikimedia.org/P40812 and previous config saved to /var/cache/conftool/dbconfig/20221123-180648-marostegui.json
[18:06:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[18:06:55] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[18:07:01] <wikibugs>	 (03PS2) 10Hashar: eslint: switch to es2018 [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086
[18:07:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[18:07:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T321126)', diff saved to https://phabricator.wikimedia.org/P40813 and previous config saved to /var/cache/conftool/dbconfig/20221123-180709-marostegui.json
[18:07:59] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply
[18:08:32] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[18:09:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321126)', diff saved to https://phabricator.wikimedia.org/P40814 and previous config saved to /var/cache/conftool/dbconfig/20221123-180924-marostegui.json
[18:10:43] <wikibugs>	 (03CR) 10Hashar: eslint: switch to es2018 (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860086 (owner: 10Hashar)
[18:11:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P40815 and previous config saved to /var/cache/conftool/dbconfig/20221123-181132-ladsgroup.json
[18:12:17] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020
[18:12:22] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[18:16:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10Jclark-ctr) 05Open→03Resolved Replaced cable Error has cleared
[18:17:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:18:02] <wikibugs>	 (03PS7) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[18:18:36] <icinga-wm>	 RECOVERY - Host lvs1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[18:22:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40816 and previous config saved to /var/cache/conftool/dbconfig/20221123-182220-ladsgroup.json
[18:22:27] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:22:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:24:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40817 and previous config saved to /var/cache/conftool/dbconfig/20221123-182431-marostegui.json
[18:26:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P40818 and previous config saved to /var/cache/conftool/dbconfig/20221123-182638-ladsgroup.json
[18:30:36] <wikibugs>	 (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[18:36:24] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1056.mgmt.eqiad.wmnet with reboot policy FORCED
[18:37:05] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1057.mgmt.eqiad.wmnet with reboot policy FORCED
[18:37:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P40819 and previous config saved to /var/cache/conftool/dbconfig/20221123-183726-ladsgroup.json
[18:38:26] <wikibugs>	 (03CR) 10Jbond: install_server: Add dynamic raid configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[18:38:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[18:38:51] <wikibugs>	 (03PS5) 10Vlad.shapik: WP:Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959)
[18:39:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:39:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40820 and previous config saved to /var/cache/conftool/dbconfig/20221123-183937-marostegui.json
[18:39:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs4007: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/860057 (owner: 10Ssingh)
[18:41:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host arclamp1001.mgmt.eqiad.wmnet with reboot policy FORCED
[18:41:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323214)', diff saved to https://phabricator.wikimedia.org/P40821 and previous config saved to /var/cache/conftool/dbconfig/20221123-184145-ladsgroup.json
[18:41:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[18:41:51] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:42:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[18:42:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40822 and previous config saved to /var/cache/conftool/dbconfig/20221123-184207-ladsgroup.json
[18:42:23] <sukhe>	 !log restart pybal on lvs4007.ulsfo.wmnet
[18:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:46] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add lvs4010 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/860089 (https://phabricator.wikimedia.org/T317247)
[18:44:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:44:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs4010: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/860067 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[18:44:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:45:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS buster
[18:45:33] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster
[18:51:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host arclamp1001.mgmt.eqiad.wmnet with reboot policy FORCED
[18:52:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P40823 and previous config saved to /var/cache/conftool/dbconfig/20221123-185233-ladsgroup.json
[18:53:05] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[18:53:11] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[18:54:17] <wikibugs>	 (03PS1) 10Papaul: Add contint1002 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860093 (https://phabricator.wikimedia.org/T313830)
[18:54:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321126)', diff saved to https://phabricator.wikimedia.org/P40824 and previous config saved to /var/cache/conftool/dbconfig/20221123-185444-marostegui.json
[18:54:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[18:54:50] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[18:54:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[18:55:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['arclamp1001']
[18:55:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T321126)', diff saved to https://phabricator.wikimedia.org/P40825 and previous config saved to /var/cache/conftool/dbconfig/20221123-185505-marostegui.json
[18:56:11] <logmsgbot>	 !log btullis@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[18:58:02] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "compiles now with no diff:) https://puppet-compiler.wmflabs.org/output/852260/38414/" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[18:59:07] <wikibugs>	 (03CR) 10Dzahn: "Would be great if this could be done with approval of serviceops-core team." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond)
[18:59:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321126)', diff saved to https://phabricator.wikimedia.org/P40826 and previous config saved to /var/cache/conftool/dbconfig/20221123-185920-marostegui.json
[18:59:38] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "to be merged during next migration window" [puppet] - 10https://gerrit.wikimedia.org/r/859628 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[18:59:41] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add contint1002 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860093 (https://phabricator.wikimedia.org/T313830) (owner: 10Papaul)
[19:00:05] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T1900)
[19:02:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab, 10Patch-For-Review: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Dzahn) >>! In T313830#8366381, @hashar wrote: > Given this task to replace contint1001, its IPv4 address can be reclaimed once the migration has complete...
[19:03:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul)
[19:03:54] <wikibugs>	 (03PS1) 10Ssingh: lvs4010: set as secondary LVS and remove lvs4007 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/860094 (https://phabricator.wikimedia.org/T317247)
[19:04:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab, 10Patch-For-Review: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul)
[19:04:40] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1057.mgmt.eqiad.wmnet with reboot policy FORCED
[19:05:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['arclamp1001']
[19:05:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[19:06:17] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1058.mgmt.eqiad.wmnet with reboot policy FORCED
[19:06:44] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:06:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[19:07:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul)
[19:07:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40827 and previous config saved to /var/cache/conftool/dbconfig/20221123-190739-ladsgroup.json
[19:07:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:07:46] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:07:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:08:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40828 and previous config saved to /var/cache/conftool/dbconfig/20221123-190812-ladsgroup.json
[19:09:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[19:09:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host contint1002.wikimedia.org with OS buster
[19:09:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab, 10Patch-For-Review: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host contint1002.wikimedia.org with OS buster
[19:11:52] <wikibugs>	 (03PS1) 10Jdlrobson: Update ky wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T322041)
[19:13:21] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye
[19:13:27] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[19:14:09] <wikibugs>	 (03PS2) 10Jdlrobson: Update ky wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722)
[19:14:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40829 and previous config saved to /var/cache/conftool/dbconfig/20221123-191427-marostegui.json
[19:15:27] <wikibugs>	 (03PS1) 10Papaul: Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330)
[19:16:03] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[19:16:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330) (owner: 10Papaul)
[19:16:10] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[19:18:39] <wikibugs>	 (03PS2) 10Papaul: Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330)
[19:20:12] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add arclam1001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/860098 (https://phabricator.wikimedia.org/T3194330) (owner: 10Papaul)
[19:21:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1002.wikimedia.org with reason: host reimage
[19:24:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1002.wikimedia.org with reason: host reimage
[19:26:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS buster
[19:26:34] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster completed: - lvs4010 (**...
[19:28:04] <wikibugs>	 (03CR) 10Ryan Kemper: elastic: change java GC options to default for ES7 (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking)
[19:28:21] <wikibugs>	 (03CR) 10Ryan Kemper: "(Had forgotten to publish draft comments so just published them)" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking)
[19:29:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp1001.eqiad.wmnet with OS bullseye
[19:29:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye
[19:29:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40830 and previous config saved to /var/cache/conftool/dbconfig/20221123-192934-marostegui.json
[19:32:49] <wikibugs>	 (03PS1) 10Jbond: install_server: use cut instead of awk [puppet] - 10https://gerrit.wikimedia.org/r/860101
[19:33:26] <wikibugs>	 (03PS1) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102
[19:34:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[19:34:55] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1058.mgmt.eqiad.wmnet with reboot policy FORCED
[19:35:11] <wikibugs>	 (03PS2) 10Jbond: install_server: use cut instead of awk [puppet] - 10https://gerrit.wikimedia.org/r/860101
[19:35:46] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED
[19:35:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs4010 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/860089 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[19:37:08] <sukhe>	 !log running homer for Gerrit: 860089
[19:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:47] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED
[19:37:53] <mutante>	 !log phab1004 - re-enabling puppet - phd should stay stopped, dumps and logmail should keep running
[19:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:33] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED
[19:38:59] <sukhe>	 !log [done] running homer for Gerrit: 860089
[19:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint1002.wikimedia.org with OS buster
[19:39:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host contint1002.wikimedia.org with OS buster completed: - contint1002 (**PASS**)   -...
[19:41:08] <sukhe>	 !log decommission lvs4007: T317247
[19:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:14] <stashbot>	 T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247
[19:41:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul)
[19:41:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs4007.ulsfo.wmnet
[19:42:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) 05Open→03Resolved @LSobanski this is done
[19:42:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install_server: use cut instead of awk [puppet] - 10https://gerrit.wikimedia.org/r/860101 (owner: 10Jbond)
[19:43:17] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye
[19:43:24] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[19:44:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321126)', diff saved to https://phabricator.wikimedia.org/P40831 and previous config saved to /var/cache/conftool/dbconfig/20221123-194441-marostegui.json
[19:44:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:44:47] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[19:44:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:45:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance
[19:45:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance
[19:45:32] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:45:38] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:45:46] <sukhe>	 ^ expected
[19:45:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance
[19:45:49] <herron>	 ack
[19:45:55] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[19:46:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance
[19:46:01] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[19:46:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2108.codfw.wmnet with reason: Maintenance
[19:46:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2108.codfw.wmnet with reason: Maintenance
[19:46:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T321126)', diff saved to https://phabricator.wikimedia.org/P40832 and previous config saved to /var/cache/conftool/dbconfig/20221123-194646-marostegui.json
[19:46:52] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs4007 [homer/public] - 10https://gerrit.wikimedia.org/r/860103 (https://phabricator.wikimedia.org/T317247)
[19:48:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[19:49:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321126)', diff saved to https://phabricator.wikimedia.org/P40833 and previous config saved to /var/cache/conftool/dbconfig/20221123-194918-marostegui.json
[19:49:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski)
[19:49:52] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) 05Stalled→03Open a:05LSobanski→03None
[19:51:09] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) also see T313830#8418218
[19:51:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs4007.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[19:52:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Dzahn) If this is done, I assume the IP addresses can't have stayed the same as @Hashar was asking.  But given that netbox will assign one automatically that was probably neve...
[19:52:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:54:35] <sukhe>	 I have some contint changes on in the dns cookbok
[19:54:41] <sukhe>	 is it OK to merge those?
[19:54:49] <sukhe>	 mutante: ^ since I saw your comment on the contint thing, I think :)
[19:54:57] <mutante>	 sukhe: no, it's not me :)
[19:55:01] <sukhe>	 oh sorry
[19:55:16] <mutante>	 I was just wondering about related stuff
[19:55:19] <sukhe>	 papaul: ^
[19:55:27] <sukhe>	 it was papaul, I just saw your last comment on IRC and assumed it was you :P
[19:59:19] <wikibugs>	 (03PS2) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102
[19:59:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs4007.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[19:59:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:59:37] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs4007.ulsfo.wmnet
[19:59:45] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs4007.ulsfo.wmnet` - lvs4007.ulsfo.wmnet (**WARN**)   - D...
[19:59:55] <wikibugs>	 (03CR) 10jenkins-bot: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[20:00:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs4007 [homer/public] - 10https://gerrit.wikimedia.org/r/860103 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[20:01:06] <wikibugs>	 (03PS3) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102
[20:01:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[20:02:19] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye
[20:02:25] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[20:02:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:03:19] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[20:03:26] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[20:03:39] <sukhe>	 !log running homer for Gerrit: 860103
[20:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:15] <wikibugs>	 (03PS4) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102
[20:04:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40835 and previous config saved to /var/cache/conftool/dbconfig/20221123-200424-marostegui.json
[20:05:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs4010: set as secondary LVS and remove lvs4007 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/860094 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[20:06:02] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for phab1004.eqiad.wmnet
[20:06:03] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for phab1004.eqiad.wmnet
[20:06:21] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ssingh)
[20:07:26] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1059.mgmt.eqiad.wmnet with reboot policy FORCED
[20:07:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[20:08:57] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED
[20:11:28] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/860071 (https://phabricator.wikimedia.org/T323718) (owner: 10Filippo Giunchedi)
[20:14:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40836 and previous config saved to /var/cache/conftool/dbconfig/20221123-201407-ladsgroup.json
[20:14:14] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:19:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40837 and previous config saved to /var/cache/conftool/dbconfig/20221123-201931-marostegui.json
[20:20:06] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED
[20:20:29] <wikibugs>	 (03PS3) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627)
[20:20:34] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED
[20:20:55] <wikibugs>	 (03PS4) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627)
[20:21:43] <wikibugs>	 (03PS5) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627)
[20:21:57] <wikibugs>	 (03PS6) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627)
[20:29:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P40838 and previous config saved to /var/cache/conftool/dbconfig/20221123-202914-ladsgroup.json
[20:34:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321126)', diff saved to https://phabricator.wikimedia.org/P40839 and previous config saved to /var/cache/conftool/dbconfig/20221123-203437-marostegui.json
[20:34:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2120.codfw.wmnet with reason: Maintenance
[20:34:44] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[20:34:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2120.codfw.wmnet with reason: Maintenance
[20:35:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T321126)', diff saved to https://phabricator.wikimedia.org/P40840 and previous config saved to /var/cache/conftool/dbconfig/20221123-203459-marostegui.json
[20:37:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321126)', diff saved to https://phabricator.wikimedia.org/P40841 and previous config saved to /var/cache/conftool/dbconfig/20221123-203731-marostegui.json
[20:38:10] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye
[20:38:17] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[20:40:56] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[20:41:03] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[20:41:44] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host arclamp1001.eqiad.wmnet with OS bullseye
[20:41:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye executed with errors: - arclamp1001...
[20:44:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P40842 and previous config saved to /var/cache/conftool/dbconfig/20221123-204420-ladsgroup.json
[20:46:01] <wikibugs>	 10SRE, 10Tracking-Neverending: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10Aklapper)
[20:48:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40843 and previous config saved to /var/cache/conftool/dbconfig/20221123-204816-ladsgroup.json
[20:48:23] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:50:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) @Dzahn yes the server has a Public IP address
[20:52:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp1001.eqiad.wmnet with OS bullseye
[20:52:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye
[20:52:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40844 and previous config saved to /var/cache/conftool/dbconfig/20221123-205238-marostegui.json
[20:56:14] * TheresNoTime is going to be unavailable for deploy this evening
[20:56:36] <wikibugs>	 (03PS5) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102
[20:57:55] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be2050.codfw.wmnet with OS bullseye
[20:58:01] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[20:59:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[20:59:17] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[20:59:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40845 and previous config saved to /var/cache/conftool/dbconfig/20221123-205926-ladsgroup.json
[20:59:27] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[20:59:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[20:59:32] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:59:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221123T2100).
[21:00:05] <jouncebot>	 cirno and jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:17] <jan_drewniak>	 o/
[21:00:26] <cjming>	 hi ! i can deploy
[21:01:18] <cirno>	 o/
[21:01:48] <cjming>	 cirno: i'll start with your patch
[21:02:11] <cjming>	 jan_drewniak: nice to see you here :) do you want to self-deploy after i finish the current patch?
[21:03:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P40846 and previous config saved to /var/cache/conftool/dbconfig/20221123-210322-ladsgroup.json
[21:04:00] <jan_drewniak>	 cjming: I haven't done one of these in a while, is it basically these instructions? https://deploy-commands.toolforge.org/bacc/860096 
[21:04:33] <cjming>	 jan_drewniak: ya - i'm also happy to do it for you if you prefer
[21:04:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[21:05:26] <wikibugs>	 (03Merged) 10jenkins-bot: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[21:05:41] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]]
[21:05:47] <stashbot>	 T323627: Update favicon and CentralAuthLoginIcon for wikifunctionswiki - https://phabricator.wikimedia.org/T323627
[21:06:13] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1060.mgmt.eqiad.wmnet with reboot policy FORCED
[21:06:47] <cirno>	 cjming: I don't think this patch could be tested, as beta cluster is not supported by WikimediaDebug, so can we sync directly?
[21:07:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40848 and previous config saved to /var/cache/conftool/dbconfig/20221123-210744-marostegui.json
[21:08:39] <logmsgbot>	 !log cjming@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. (duration: 02m 57s)
[21:10:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[21:10:28] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]]
[21:10:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:10:38] <jan_drewniak>	 cjming: actually this scap backport command is new to me, so I wanna try it out :P
[21:11:11] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED
[21:11:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:11:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:11:40] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED
[21:12:09] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED
[21:12:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:13:55] <cjming>	 jan_drewniak: sounds good ! i'll let you know when i'm done with cirno's patch
[21:14:17] <cjming>	 cirno: apologies - i'm having some issues with my account on the deployment server - just need a few mins to troubleshoot
[21:16:52] <logmsgbot>	 !log cjming@deploy1002 sync-world aborted: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] (duration: 06m 24s)
[21:16:53] <logmsgbot>	 !log cjming@deploy1002 backport aborted:  (duration: 06m 39s)
[21:16:58] <stashbot>	 T323627: Update favicon and CentralAuthLoginIcon for wikifunctionswiki - https://phabricator.wikimedia.org/T323627
[21:17:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:18:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[21:18:14] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]]
[21:18:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P40849 and previous config saved to /var/cache/conftool/dbconfig/20221123-211829-ladsgroup.json
[21:19:33] <logmsgbot>	 !log brennen@deploy1002 brennen and stang: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:20:28] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM: balance in the spreadsheet looks good, and the CR matches the spreadsheet." [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto)
[21:22:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul)
[21:22:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:22:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321126)', diff saved to https://phabricator.wikimedia.org/P40850 and previous config saved to /var/cache/conftool/dbconfig/20221123-212250-marostegui.json
[21:22:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance
[21:22:57] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[21:23:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance
[21:23:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:23:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:23:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T321126)', diff saved to https://phabricator.wikimedia.org/P40851 and previous config saved to /var/cache/conftool/dbconfig/20221123-212312-marostegui.json
[21:23:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:24:43] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:859510|Update favicon and CentralAuthLoginIcon for wikifunctionswiki (T323627)]] (duration: 06m 29s)
[21:24:49] <stashbot>	 T323627: Update favicon and CentralAuthLoginIcon for wikifunctionswiki - https://phabricator.wikimedia.org/T323627
[21:25:04] <cjming>	 jan_drewniak: feel free to try your patch -- i'm curious if you get prompted for a sudo pw when you sync
[21:25:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321126)', diff saved to https://phabricator.wikimedia.org/P40852 and previous config saved to /var/cache/conftool/dbconfig/20221123-212543-marostegui.json
[21:25:56] <cjming>	 i got stalled by prod and need to file a ticket for my account
[21:26:37] <cjming>	 cirno: your patch is live - purging files now
[21:28:12] <jan_drewniak>	 cjming: ok I'm giving it a shot
[21:28:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) (owner: 10Jdlrobson)
[21:29:40] <wikibugs>	 (03Merged) 10jenkins-bot: Update ky wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) (owner: 10Jdlrobson)
[21:29:53] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]]
[21:29:59] <stashbot>	 T323722: Deploy new logo to Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T323722
[21:30:33] <jan_drewniak>	 cjming: yup, looks like I'm prompted for a sudo password. Not sure what to do there... 
[21:30:59] <brennen>	 jan_drewniak: yeah, same glitch cjming was running into, i can go ahead and run it since it seems to work for me
[21:31:14] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be2050.codfw.wmnet with OS bullseye
[21:31:18] <cjming>	 jan_drewniak: gtk - i'll file a ticket and include that you got this prompt too
[21:31:20] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[21:31:26] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1061.mgmt.eqiad.wmnet with reboot policy FORCED
[21:31:29] <jan_drewniak>	 brennen: thanks that'd be great! 
[21:31:32] <logmsgbot>	 !log jdrewniak@deploy1002 sync-world aborted: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] (duration: 01m 38s)
[21:31:32] <logmsgbot>	 !log jdrewniak@deploy1002 backport aborted:  (duration: 02m 40s)
[21:31:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860096 (https://phabricator.wikimedia.org/T323722) (owner: 10Jdlrobson)
[21:31:49] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[21:31:55] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[21:31:56] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]]
[21:32:11] <jan_drewniak>	 I mean it does say `21:29:56 Running sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release` so... 
[21:32:43] <cjming>	 cirno: brennen sync'd and purged your files - should be live
[21:32:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:33:15] <logmsgbot>	 !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:33:17] <cirno>	 cjming: confirmed, thanks
[21:33:20] <brennen>	 jan_drewniak: yeah, i think there's probably just a mismatch on group membership or something here, we'll dig in a bit
[21:33:33] <brennen>	 jan_drewniak: on text boxen...
[21:33:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323214)', diff saved to https://phabricator.wikimedia.org/P40853 and previous config saved to /var/cache/conftool/dbconfig/20221123-213335-ladsgroup.json
[21:33:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[21:33:42] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[21:33:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[21:33:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T323214)', diff saved to https://phabricator.wikimedia.org/P40854 and previous config saved to /var/cache/conftool/dbconfig/20221123-213357-ladsgroup.json
[21:34:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:34:07] <jan_drewniak>	 brennen: ok, I see the change on wmdebug1002, looks good to sync
[21:34:10] <wikibugs>	 (03PS1) 10Jbond: install_server: fix config for ms-be dynamic partition [puppet] - 10https://gerrit.wikimedia.org/r/860114
[21:34:18] <brennen>	 cool, going ahead
[21:34:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:34:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:35:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:35:40] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto)
[21:35:52] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054']
[21:38:13] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:860096|Update ky wikipedia logo (T323722)]] (duration: 06m 17s)
[21:38:19] <stashbot>	 T323722: Deploy new logo to Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T323722
[21:38:24] <brennen>	 jan_drewniak: {{done}}
[21:38:34] <jan_drewniak>	 brennen: thanks!
[21:38:38] <brennen>	 !log end of utc late backport and config window
[21:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:43] <wikibugs>	 (03PS1) 10Stang: wikidatawiki: Add language-specific logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860117 (https://phabricator.wikimedia.org/T323734)
[21:39:42] <cjming>	 jan_drewniak: if you want to add any other details - i mentioned you on the ticket https://phabricator.wikimedia.org/T323735
[21:40:19] <jan_drewniak>	 cjming: thanks! I think that pretty much sums it up :) 
[21:40:31] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Partial Revert "Remove pre PHP 7.4 serialize()/unserialize()" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/860030 (https://phabricator.wikimedia.org/T323236) (owner: 10Reedy)
[21:40:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40855 and previous config saved to /var/cache/conftool/dbconfig/20221123-214050-marostegui.json
[21:44:00] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1054']
[21:44:15] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054']
[21:45:01] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1054']
[21:46:27] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "hieradata: switch active Phabricator server to phab1004"" [puppet] - 10https://gerrit.wikimedia.org/r/860031
[21:46:41] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032
[21:47:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032 (owner: 10Dzahn)
[21:48:22] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye
[21:48:29] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[21:48:40] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye
[21:48:47] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond...
[21:54:32] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host arclamp1001.eqiad.wmnet with OS bullseye
[21:54:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye executed with errors: - arclamp1001...
[21:54:41] <wikibugs>	 (03Merged) 10jenkins-bot: Partial Revert "Remove pre PHP 7.4 serialize()/unserialize()" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/860030 (https://phabricator.wikimedia.org/T323236) (owner: 10Reedy)
[21:55:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:55:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40857 and previous config saved to /var/cache/conftool/dbconfig/20221123-215557-marostegui.json
[21:56:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:56:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:57:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:59:50] <logmsgbot>	 !log reedy@deploy1002 Synchronized php-1.40.0-wmf.10/includes/language/Message.php: T323236 (duration: 04m 35s)
[21:59:56] <stashbot>	 T323236: PHP Warning: Class RawMessage has no unserializer - https://phabricator.wikimedia.org/T323236
[22:02:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[22:02:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[22:02:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[22:03:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[22:11:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321126)', diff saved to https://phabricator.wikimedia.org/P40858 and previous config saved to /var/cache/conftool/dbconfig/20221123-221103-marostegui.json
[22:11:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2122.codfw.wmnet with reason: Maintenance
[22:11:10] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[22:11:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2122.codfw.wmnet with reason: Maintenance
[22:11:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T321126)', diff saved to https://phabricator.wikimedia.org/P40859 and previous config saved to /var/cache/conftool/dbconfig/20221123-221125-marostegui.json
[22:13:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321126)', diff saved to https://phabricator.wikimedia.org/P40860 and previous config saved to /var/cache/conftool/dbconfig/20221123-221356-marostegui.json
[22:21:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323214)', diff saved to https://phabricator.wikimedia.org/P40861 and previous config saved to /var/cache/conftool/dbconfig/20221123-222105-ladsgroup.json
[22:21:12] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:22:17] <wikibugs>	 (03PS1) 10Brennen Bearnes: sudo: add update-mediawiki-tools release to deployers [puppet] - 10https://gerrit.wikimedia.org/r/860121 (https://phabricator.wikimedia.org/T323735)
[22:25:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[22:25:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:26:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[22:26:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[22:26:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[22:26:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40862 and previous config saved to /var/cache/conftool/dbconfig/20221123-222627-ladsgroup.json
[22:26:33] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:29:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40864 and previous config saved to /var/cache/conftool/dbconfig/20221123-222903-marostegui.json
[22:30:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:32:30] <urandom>	 Well damn, I hope I didn't just break deployment-prep; I hope nothing relied on the sessionstore instance there
[22:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P40865 and previous config saved to /var/cache/conftool/dbconfig/20221123-223611-ladsgroup.json
[22:40:52] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2050.codfw.wmnet with OS bullseye
[22:40:59] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum...
[22:44:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40866 and previous config saved to /var/cache/conftool/dbconfig/20221123-224409-marostegui.json
[22:51:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P40868 and previous config saved to /var/cache/conftool/dbconfig/20221123-225118-ladsgroup.json
[22:59:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321126)', diff saved to https://phabricator.wikimedia.org/P40869 and previous config saved to /var/cache/conftool/dbconfig/20221123-225916-marostegui.json
[22:59:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance
[22:59:23] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[22:59:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance
[22:59:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T321126)', diff saved to https://phabricator.wikimedia.org/P40870 and previous config saved to /var/cache/conftool/dbconfig/20221123-225937-marostegui.json
[23:02:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321126)', diff saved to https://phabricator.wikimedia.org/P40871 and previous config saved to /var/cache/conftool/dbconfig/20221123-230209-marostegui.json
[23:06:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323214)', diff saved to https://phabricator.wikimedia.org/P40872 and previous config saved to /var/cache/conftool/dbconfig/20221123-230624-ladsgroup.json
[23:06:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[23:06:32] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[23:06:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[23:17:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40874 and previous config saved to /var/cache/conftool/dbconfig/20221123-231716-marostegui.json
[23:20:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse)
[23:22:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse) The request checklist for access is completed.  I think we can merge patch [[ https://gerrit.wikimedia.org/r/c/854952 | #854952 ]]...
[23:23:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10andrea.denisse)
[23:23:59] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "Approving because the access checklist is completed." [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) (owner: 10Filippo Giunchedi)
[23:25:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10andrea.denisse)
[23:32:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40875 and previous config saved to /var/cache/conftool/dbconfig/20221123-233222-marostegui.json
[23:32:55] <wikibugs>	 (03PS1) 10Ebernhardson: mjolnir msearch: Reduce allowed concurrency [puppet] - 10https://gerrit.wikimedia.org/r/860129 (https://phabricator.wikimedia.org/T318575)
[23:47:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321126)', diff saved to https://phabricator.wikimedia.org/P40876 and previous config saved to /var/cache/conftool/dbconfig/20221123-234729-marostegui.json
[23:47:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance
[23:47:36] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[23:47:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance
[23:47:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance
[23:48:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance
[23:48:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T321126)', diff saved to https://phabricator.wikimedia.org/P40877 and previous config saved to /var/cache/conftool/dbconfig/20221123-234806-marostegui.json
[23:50:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321126)', diff saved to https://phabricator.wikimedia.org/P40878 and previous config saved to /var/cache/conftool/dbconfig/20221123-235037-marostegui.json
[23:59:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40879 and previous config saved to /var/cache/conftool/dbconfig/20221123-235928-ladsgroup.json
[23:59:35] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214