[00:03:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:22:43] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:43:16] <legoktm>	 !issync
[00:43:17] <ircservserv-wm>	 Syncing #wikimedia-operations (requested by legoktm)
[00:43:18] <ircservserv-wm>	 Set /cs flags #wikimedia-operations Majavah +Aiotv
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:02:09] <icinga-wm>	 RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) resolved: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:03:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[05:29:25] <wikibugs>	 (03PS1) 10Marostegui: db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837465
[05:29:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P35244 and previous config saved to /var/cache/conftool/dbconfig/20221003-052927-root.json
[05:32:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837465 (owner: 10Marostegui)
[05:42:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35245 and previous config saved to /var/cache/conftool/dbconfig/20221003-054206-root.json
[05:42:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1167', diff saved to https://phabricator.wikimedia.org/P35246 and previous config saved to /var/cache/conftool/dbconfig/20221003-054245-root.json
[05:50:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35247 and previous config saved to /var/cache/conftool/dbconfig/20221003-055052-root.json
[05:51:08] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837001
[05:53:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837001 (owner: 10Marostegui)
[05:54:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158', diff saved to https://phabricator.wikimedia.org/P35248 and previous config saved to /var/cache/conftool/dbconfig/20221003-055401-root.json
[05:57:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35249 and previous config saved to /var/cache/conftool/dbconfig/20221003-055711-root.json
[06:04:52] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15133
[06:05:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35250 and previous config saved to /var/cache/conftool/dbconfig/20221003-060557-root.json
[06:07:40] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15133
[06:12:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35251 and previous config saved to /var/cache/conftool/dbconfig/20221003-061216-root.json
[06:13:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3300
[06:15:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3300
[06:20:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35252 and previous config saved to /var/cache/conftool/dbconfig/20221003-062022-root.json
[06:20:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:21:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35253 and previous config saved to /var/cache/conftool/dbconfig/20221003-062102-root.json
[06:25:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:26:53] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 5400
[06:27:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5400
[06:27:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35254 and previous config saved to /var/cache/conftool/dbconfig/20221003-062721-root.json
[06:30:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11039
[06:30:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove innodb_large_prefix flag. [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879)
[06:30:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11039
[06:33:08] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Remove innodb_large_prefix flag. [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879)
[06:35:01] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:35:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35255 and previous config saved to /var/cache/conftool/dbconfig/20221003-063527-root.json
[06:36:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35256 and previous config saved to /var/cache/conftool/dbconfig/20221003-063607-root.json
[06:42:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35257 and previous config saved to /var/cache/conftool/dbconfig/20221003-064226-root.json
[06:46:03] <wikibugs>	 (03PS1) 10Marostegui: db1182: Upgrade from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837491 (https://phabricator.wikimedia.org/T301879)
[06:46:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P35258 and previous config saved to /var/cache/conftool/dbconfig/20221003-064638-root.json
[06:47:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1182: Upgrade from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837491 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[06:48:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 6128
[06:50:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35259 and previous config saved to /var/cache/conftool/dbconfig/20221003-065031-root.json
[06:51:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35260 and previous config saved to /var/cache/conftool/dbconfig/20221003-065112-root.json
[06:51:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35261 and previous config saved to /var/cache/conftool/dbconfig/20221003-065154-root.json
[06:52:00] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 6128
[06:56:06] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove semi_sync plugin [puppet] - 10https://gerrit.wikimedia.org/r/837492 (https://phabricator.wikimedia.org/T318914)
[06:57:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35262 and previous config saved to /var/cache/conftool/dbconfig/20221003-065731-root.json
[06:58:17] <wikibugs>	 (03PS1) 10Marostegui: db2175: Upgrade mariadb to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837493 (https://phabricator.wikimedia.org/T318914)
[06:58:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2175', diff saved to https://phabricator.wikimedia.org/P35263 and previous config saved to /var/cache/conftool/dbconfig/20221003-065844-root.json
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2175: Upgrade mariadb to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837493 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui)
[07:01:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:03:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:04:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35264 and previous config saved to /var/cache/conftool/dbconfig/20221003-070431-root.json
[07:05:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35265 and previous config saved to /var/cache/conftool/dbconfig/20221003-070536-root.json
[07:06:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35266 and previous config saved to /var/cache/conftool/dbconfig/20221003-070617-root.json
[07:07:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 3%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35267 and previous config saved to /var/cache/conftool/dbconfig/20221003-070659-root.json
[07:12:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35268 and previous config saved to /var/cache/conftool/dbconfig/20221003-071236-root.json
[07:16:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[07:19:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35269 and previous config saved to /var/cache/conftool/dbconfig/20221003-071936-root.json
[07:20:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35270 and previous config saved to /var/cache/conftool/dbconfig/20221003-072041-root.json
[07:21:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35271 and previous config saved to /var/cache/conftool/dbconfig/20221003-072122-root.json
[07:22:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35272 and previous config saved to /var/cache/conftool/dbconfig/20221003-072204-root.json
[07:26:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "So this didn't work even with binlog disabled?" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[07:27:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35273 and previous config saved to /var/cache/conftool/dbconfig/20221003-072741-root.json
[07:34:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35274 and previous config saved to /var/cache/conftool/dbconfig/20221003-073441-root.json
[07:35:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35275 and previous config saved to /var/cache/conftool/dbconfig/20221003-073546-root.json
[07:35:54] <wikibugs>	 (03PS1) 10Marostegui: db1120: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837497 (https://phabricator.wikimedia.org/T301879)
[07:35:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1200', diff saved to https://phabricator.wikimedia.org/P35276 and previous config saved to /var/cache/conftool/dbconfig/20221003-073556-root.json
[07:36:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35277 and previous config saved to /var/cache/conftool/dbconfig/20221003-073627-root.json
[07:36:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1200.eqiad.wmnet with reason: Upgrade to 10.6
[07:36:40] <XioNoX>	 !log cr2-drmrs# set chassis fpc 0 sampling-instance pmacct
[07:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1200.eqiad.wmnet with reason: Upgrade to 10.6
[07:36:58] <wikibugs>	 10SRE, 10observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10fgiunchedi) >>! In T318911#8272789, @ssingh wrote:  > So to summarize, a short term fix can be to delete the misbehaving CT log (https://yeti2023.ct.digicert.com/log/). But a long term fix needs to include...
[07:37:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35278 and previous config saved to /var/cache/conftool/dbconfig/20221003-073709-root.json
[07:37:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1120: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837497 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[07:39:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35279 and previous config saved to /var/cache/conftool/dbconfig/20221003-073944-root.json
[07:42:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16637
[07:42:32] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16637
[07:49:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35280 and previous config saved to /var/cache/conftool/dbconfig/20221003-074946-root.json
[07:50:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35281 and previous config saved to /var/cache/conftool/dbconfig/20221003-075051-root.json
[07:51:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:52:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35282 and previous config saved to /var/cache/conftool/dbconfig/20221003-075214-root.json
[07:54:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35283 and previous config saved to /var/cache/conftool/dbconfig/20221003-075449-root.json
[07:56:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:56:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2178', diff saved to https://phabricator.wikimedia.org/P35284 and previous config saved to /var/cache/conftool/dbconfig/20221003-075643-root.json
[07:56:55] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:57:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6
[07:57:26] <wikibugs>	 (03PS1) 10Marostegui: db2178: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837614 (https://phabricator.wikimedia.org/T301879)
[07:57:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6
[07:58:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2178: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837614 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[07:59:50] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] flake8: Several pep8/flake8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/837126 (owner: 10David Caro)
[08:00:13] <wikibugs>	 10SRE: Add PKI support to Pontoon - https://phabricator.wikimedia.org/T319163 (10fgiunchedi)
[08:03:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:04:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35285 and previous config saved to /var/cache/conftool/dbconfig/20221003-080451-root.json
[08:05:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6
[08:05:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6
[08:05:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16509
[08:05:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35286 and previous config saved to /var/cache/conftool/dbconfig/20221003-080556-root.json
[08:05:59] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'email' for AS: 16509
[08:07:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35287 and previous config saved to /var/cache/conftool/dbconfig/20221003-080719-root.json
[08:09:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35288 and previous config saved to /var/cache/conftool/dbconfig/20221003-080954-root.json
[08:12:42] <wikibugs>	 (03PS1) 10Marostegui: db2178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837615
[08:13:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837615 (owner: 10Marostegui)
[08:16:49] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 39386
[08:19:28] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10dcaro)
[08:19:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35289 and previous config saved to /var/cache/conftool/dbconfig/20221003-081955-root.json
[08:21:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 39386
[08:22:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35290 and previous config saved to /var/cache/conftool/dbconfig/20221003-082224-root.json
[08:23:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30781
[08:24:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30781
[08:25:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35291 and previous config saved to /var/cache/conftool/dbconfig/20221003-082459-root.json
[08:26:05] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12975
[08:26:20] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12975
[08:26:59] <wikibugs>	 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request - Calculator Service - https://phabricator.wikimedia.org/T273807 (10Joe) 05Open→03Invalid Closing as invalid because I don't think we need this anymore
[08:28:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15557
[08:29:09] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15557
[08:29:27] <wikibugs>	 10ops-eqsin, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10Vgutierrez)
[08:30:00] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) p:05Triage→03High
[08:30:06] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) 05Open→03Resolved >>! In T314256#8275011, @MoritzMuehlenhoff wrote: > Traffic folks, can be please go ahead and fully decom cp5001, then? Right now this is in a weird limb...
[08:30:42] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp5001.eqsin.wmnet
[08:34:53] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12956
[08:35:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35292 and previous config saved to /var/cache/conftool/dbconfig/20221003-083502-root.json
[08:35:48] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12956
[08:35:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:36:18] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.dns.netbox
[08:37:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35293 and previous config saved to /var/cache/conftool/dbconfig/20221003-083729-root.json
[08:38:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3303
[08:39:40] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 3303
[08:40:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35294 and previous config saved to /var/cache/conftool/dbconfig/20221003-084004-root.json
[08:40:05] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:40:06] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp5001.eqsin.wmnet
[08:40:09] <wikibugs>	 10ops-eqsin, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp5001.eqsin.wmnet` - cp5001.eqsin.wmnet (**FAIL**)   - //Host not found on Icinga, unable to do...
[08:40:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:41:15] <wikibugs>	 (03CR) 10David Caro: alerts.downtime_host: attempt to match alert hostnames with :<port> (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[08:48:04] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10dcaro)
[08:48:10] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10dcaro)
[08:48:54] <wikibugs>	 10SRE, 10Pontoon: Add PKI support to Pontoon - https://phabricator.wikimedia.org/T319163 (10Aklapper)
[08:50:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35295 and previous config saved to /var/cache/conftool/dbconfig/20221003-085007-root.json
[08:51:40] <wikibugs>	 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10Vgutierrez) a:03wiki_willy
[08:52:04] <wikibugs>	 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10Vgutierrez)
[08:53:12] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 12975
[08:54:15] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12975
[08:55:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35296 and previous config saved to /var/cache/conftool/dbconfig/20221003-085509-root.json
[08:58:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2157', diff saved to https://phabricator.wikimedia.org/P35297 and previous config saved to /var/cache/conftool/dbconfig/20221003-085840-root.json
[08:59:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db[2157,2178].codfw.wmnet with reason: Reclone
[08:59:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db[2157,2178].codfw.wmnet with reason: Reclone
[09:01:37] <wikibugs>	 (03PS1) 10Marostegui: db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837619 (https://phabricator.wikimedia.org/T319169)
[09:02:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837619 (https://phabricator.wikimedia.org/T319169) (owner: 10Marostegui)
[09:02:34] <wikibugs>	 (03PS4) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921)
[09:10:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35299 and previous config saved to /var/cache/conftool/dbconfig/20221003-091014-root.json
[09:11:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62044
[09:11:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62044
[09:14:21] <wikibugs>	 (03CR) 10Elukey: Update calico to v3.23.3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:16:29] <wikibugs>	 (03PS5) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921)
[09:19:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:19:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Left some nits, you can freely skip them if they are not worth it. The changes look good, even if I don't have a lot of context in what ch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:20:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM (modulo the calico-specific changes, I didn't check all of them)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:21:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31133
[09:22:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31133
[09:24:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:25:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35300 and previous config saved to /var/cache/conftool/dbconfig/20221003-092519-root.json
[09:28:54] <wikibugs>	 (03PS6) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921)
[09:30:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "looking good on https://grafana.wikimedia.org/dashboard/snapshot/0gMtUk3zjPMMopv9BCKSUxykeXHb1zXm?orgId=1" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez)
[09:30:57] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez)
[09:31:02] <wikibugs>	 (03PS1) 10Elukey: Move kafka-logging2001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130)
[09:32:48] <wikibugs>	 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) 05Open→03Resolved SLO dashboard available in https://grafana.wikimedia.org/d/slo-trafficserver-tmpl/trafficserver-slos-grizzly-template?orgId=1
[09:33:08] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37403/console" [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[09:33:22] <wikibugs>	 (03PS2) 10Elukey: Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130)
[09:33:30] <wikibugs>	 (03CR) 10Elukey: Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[09:34:33] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405)
[09:36:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:41:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:45:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez)
[09:47:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail:varnishsli: Track client sided requests only [puppet] - 10https://gerrit.wikimedia.org/r/834525 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[09:59:08] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:00:12] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore1002.eqiad.wmnet with reason: Prep for reimage
[10:00:26] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore1002.eqiad.wmnet with reason: Prep for reimage
[10:00:56] <hnowlan>	 !log c-foreach-nt drain on sessionstore1002
[10:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:02] <wikibugs>	 (03PS2) 10AOkoth: vrts: enable vrts-daemon on WMCS instance [puppet] - 10https://gerrit.wikimedia.org/r/834510 (https://phabricator.wikimedia.org/T317059)
[10:05:02] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1002.eqiad.wmnet with OS buster
[10:07:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630
[10:07:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284)
[10:08:30] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630
[10:08:32] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284)
[10:10:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez)
[10:12:33] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630
[10:12:35] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284)
[10:16:42] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1002.eqiad.wmnet with reason: host reimage
[10:19:25] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1002.eqiad.wmnet with reason: host reimage
[10:23:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "This is ok to me, but probably btullis should ok it too." [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[10:24:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Remove semi_sync plugin [puppet] - 10https://gerrit.wikimedia.org/r/837492 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui)
[10:25:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/37406/" [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez)
[10:26:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/37407/" [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[10:27:01] <wikibugs>	 (03CR) 10Jcrespo: "The patch will work, and I think we still should merge it to make sure it behaves in the same/expected way- but this didn't fix the s7 imp" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[10:30:41] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:50] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676)
[10:31:05] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
[10:31:13] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:18] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
[10:32:11] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:25] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=eqiad
[10:37:08] <_joe_>	 !log remove stale druid.svc.eqiad.wmnet certificate from the puppetmaster CA; it was expired anyways
[10:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:50] <hnowlan>	 !log starting cassandra on reimaged sessionstore1002
[10:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync
[10:40:54] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync
[10:41:00] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1002.eqiad.wmnet with OS buster
[10:41:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=eqiad
[10:48:48] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore1003.eqiad.wmnet with reason: Prep for reimage
[10:49:01] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[10:49:01] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore1003.eqiad.wmnet with reason: Prep for reimage
[10:52:56] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1003.eqiad.wmnet with OS buster
[11:02:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/836790 (owner: 10Muehlenhoff)
[11:04:50] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1003.eqiad.wmnet with reason: host reimage
[11:05:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:05:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:05:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837097 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:05:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:06:54] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond)
[11:08:12] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1003.eqiad.wmnet with reason: host reimage
[11:09:25] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676)
[11:20:20] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=eqiad
[11:27:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync
[11:27:59] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1003.eqiad.wmnet with OS buster
[11:28:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync
[11:28:29] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=eqiad
[11:29:25] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676)
[11:31:32] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) 05Open→03Invalid Thumbor is being migrated to k8s, making this task invalid :)
[11:31:40] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10jijiki)
[11:49:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez)
[11:50:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] grub: Update includes [puppet] - 10https://gerrit.wikimedia.org/r/836855 (owner: 10Muehlenhoff)
[11:51:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Also apply labweb->cloudweb rename for the Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836795 (owner: 10Muehlenhoff)
[11:51:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070 (owner: 10Muehlenhoff)
[11:54:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1117.eqiad.wmnet with reason: Reboot
[11:54:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1117.eqiad.wmnet with reason: Reboot
[11:54:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35302 and previous config saved to /var/cache/conftool/dbconfig/20221003-115449-root.json
[11:59:45] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:59:49] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:59:55] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:00:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1116.eqiad.wmnet with reason: Reboot
[12:00:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1116.eqiad.wmnet with reason: Reboot
[12:01:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Cloning
[12:01:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Cloning
[12:01:57] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:01:59] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:02:00] <wikibugs>	 (03PS2) 10Hashar: Allow SRE to send annotated and signed tags [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711
[12:02:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:02:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2123', diff saved to https://phabricator.wikimedia.org/P35303 and previous config saved to /var/cache/conftool/dbconfig/20221003-120208-root.json
[12:03:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:04:50] <wikibugs>	 (03CR) 10Hashar: Allow SRE to send annotated and signed tags (031 comment) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar)
[12:09:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove semi_sync plugin [puppet] - 10https://gerrit.wikimedia.org/r/837492 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui)
[12:09:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35305 and previous config saved to /var/cache/conftool/dbconfig/20221003-120954-root.json
[12:14:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Set binlog format for dbstore mariadb databases to ROW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[12:15:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused Debian Buster code [puppet] - 10https://gerrit.wikimedia.org/r/837656
[12:17:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC NOOP: https://puppet-compiler.wmflabs.org/pcc-worker1001/37413/" [puppet] - 10https://gerrit.wikimedia.org/r/837656 (owner: 10Arturo Borrero Gonzalez)
[12:24:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/837656 (owner: 10Arturo Borrero Gonzalez)
[12:25:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35306 and previous config saved to /var/cache/conftool/dbconfig/20221003-122459-root.json
[12:36:54] <wikibugs>	 (03CR) 10Vgutierrez: "text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) (owner: 10Vgutierrez)
[12:40:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35307 and previous config saved to /var/cache/conftool/dbconfig/20221003-124004-root.json
[12:40:15] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:43:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: cleanup unused Debian Buster code [puppet] - 10https://gerrit.wikimedia.org/r/837656 (owner: 10Arturo Borrero Gonzalez)
[12:45:07] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar)
[12:48:51] <wikibugs>	 (03PS2) 10Andrew Bogott: Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/835192 (https://phabricator.wikimedia.org/T309346)
[12:49:37] <wikibugs>	 (03Abandoned) 10Majavah: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815750 (owner: 10Majavah)
[12:51:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/835192 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott)
[12:52:59] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar)
[12:55:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35308 and previous config saved to /var/cache/conftool/dbconfig/20221003-125509-root.json
[12:59:44] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:01:58] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) (owner: 10Vgutierrez)
[13:04:05] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] prometheus: Add new scrape target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/836310 (owner: 10Raymond Ndibe)
[13:05:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) (owner: 10Vgutierrez)
[13:09:03] <icinga-wm>	 PROBLEM - Host db1189.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:10:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35310 and previous config saved to /var/cache/conftool/dbconfig/20221003-131014-root.json
[13:12:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Replaced Failed Dimm. Thanks @Marostegui
[13:12:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) 05Open→03Resolved
[13:14:42] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] Update elasticsearch memory pressure alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson)
[13:15:23] <icinga-wm>	 RECOVERY - Host db1189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[13:18:00] <vgutierrez>	 !log enforcing origin-form|asterisk-form for request-target on varnish (could trigger spikes of HTTP 400 errors) - T318676
[13:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:09] <stashbot>	 T318676: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676
[13:18:59] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:22:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Thanks John - I will take it from here and ping you if we have more issues!
[13:25:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35311 and previous config saved to /var/cache/conftool/dbconfig/20221003-132519-root.json
[13:25:29] <sukhe>	 just as a heads-up: vgutierrez and I will be upgrading to ATS9 on all cp hosts in codfw and ulsfo today. no impact expected and the caches should be preserved. see T309651
[13:25:30] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[13:25:34] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837003
[13:27:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837003 (owner: 10Marostegui)
[13:29:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35312 and previous config saved to /var/cache/conftool/dbconfig/20221003-132902-root.json
[13:30:12] <wikibugs>	 (03PS10) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412)
[13:31:26] <wikibugs>	 (03PS1) 10Ssingh: hiera: upgrade cp hosts in codfw to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651)
[13:31:35] <wikibugs>	 (03CR) 10Hashar: "Rebased due to "conflict" with I74744310538d780cff88e24b646675ad33630eb9" [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[13:31:38] <wikibugs>	 10SRE, 10Parsoid, 10serviceops, 10User-brennen, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10jijiki) @ssastry please let us know if there is anything more to be done in this task, if nor, we can resolve it
[13:31:43] <wikibugs>	 (03PS6) 10Hashar: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507
[13:31:50] <wikibugs>	 (03PS4) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379
[13:31:56] <wikibugs>	 (03PS4) 10Hashar: gerrit: use daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385
[13:32:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[13:32:49] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672
[13:34:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto)
[13:34:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 (owner: 10Giuseppe Lavagetto)
[13:34:38] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Jclark-ctr) @wiki_willy  This server is out of warranty.  We do not have any spare 1.9tb SSD.   Largest i have is 1.6tb.
[13:34:45] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Jclark-ctr) a:03Jclark-ctr
[13:37:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but keep my comment in mind." [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert)
[13:37:31] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37414/console" [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:38:29] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenSent - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:38:40] <wikibugs>	 (03PS1) 10Ssingh: hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651)
[13:39:55] <wikibugs>	 (03PS2) 10Ssingh: hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651)
[13:40:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35313 and previous config saved to /var/cache/conftool/dbconfig/20221003-134024-root.json
[13:41:00] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] parsoid: Cleanup post php7.4 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert)
[13:41:29] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37416/console" [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:42:33] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "NOOP on 4032 as it's already running ATS9 (additional confirmation)." [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:44:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35314 and previous config saved to /var/cache/conftool/dbconfig/20221003-134407-root.json
[13:51:47] <wikibugs>	 10SRE, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10jijiki) 05Open→03Resolved a:03jijiki Please reopen if needed
[13:57:50] <sukhe>	 !log reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm2_amd64.changes: T309651
[13:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:54] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[13:58:50] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Andrew) @jclark, we're not using storage on this system so there's no need to replace the drive or worry about it. I've already rebuilt the raid to exclude the broken drive.  What...
[13:59:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35315 and previous config saved to /var/cache/conftool/dbconfig/20221003-135912-root.json
[13:59:58] <wikibugs>	 10SRE, 10serviceops: Increase of varnish-be failed fetches error due to "http format error" - https://phabricator.wikimedia.org/T235254 (10jijiki) 05Open→03Resolved a:03jijiki no activity, closing for now
[14:00:46] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[14:06:20] <wikibugs>	 (03PS3) 10Andrew Bogott: Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346)
[14:07:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook)
[14:07:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott)
[14:08:30] <sukhe>	 !log upgrade cp4026, cp4032 to ATS 9.1.3-1wm2 from 9.1.3-1wm1: T309651
[14:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:34] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[14:10:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10Papaul)  Create Dispatch: Success You have successfully submitted request SR153002644.
[14:10:10] <wikibugs>	 (03PS4) 10Andrew Bogott: Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346)
[14:10:12] <wikibugs>	 (03PS1) 10Andrew Bogott: Dumps: remove ensure->absent clause [puppet] - 10https://gerrit.wikimedia.org/r/837677
[14:12:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott)
[14:13:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez)
[14:14:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35316 and previous config saved to /var/cache/conftool/dbconfig/20221003-141417-root.json
[14:20:46] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[14:23:19] <icinga-wm>	 PROBLEM - SSH on db1113.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:26:37] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::canary: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/835506 (https://phabricator.wikimedia.org/T318894)
[14:26:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::canary: cleanup php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/837681 (https://phabricator.wikimedia.org/T318894)
[14:28:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) This is now done. I'm going to gradually dismantle the old dumps servers but will probably leave their data intact f...
[14:28:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[14:28:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew)
[14:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:28:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10Andrew) 05Open→03Resolved
[14:29:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35317 and previous config saved to /var/cache/conftool/dbconfig/20221003-142923-root.json
[14:30:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[14:31:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[14:31:42] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284)
[14:31:47] <papaul>	 !log on going maintenance on mr1-esams
[14:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:35:21] <sukhe>	 !log upgrade A:cp and A:drmrs to ATS 9.1.3-1wm2 from 9.1.3-1wm1: T309651
[14:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:25] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[14:36:37] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[14:44:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35318 and previous config saved to /var/cache/conftool/dbconfig/20221003-144428-root.json
[14:48:45] <icinga-wm>	 PROBLEM - Host asw2-esams is DOWN: PING CRITICAL - Packet loss = 100%
[14:53:33] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:53:36] <wikibugs>	 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez)
[14:53:43] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:53:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:39] <icinga-wm>	 PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:56:29] <wikibugs>	 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) T317660 has been fixed by the shipping of trafficserver 9.1.3-1wm2 including https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/834045
[14:58:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:58:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in codfw to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:59:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35319 and previous config saved to /var/cache/conftool/dbconfig/20221003-145933-root.json
[15:01:32] <icinga-wm>	 RECOVERY - Host asw2-esams is UP: PING OK - Packet loss = 0%, RTA = 81.58 ms
[15:01:40] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:01:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:02:05] <wikibugs>	 (03PS2) 10Ebernhardson: Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180
[15:02:07] <wikibugs>	 (03CR) 10Ebernhardson: Update elasticsearch memory pressure alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson)
[15:02:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) Synced on IRC, we're aiming at Thursday 1pm UTC.
[15:02:42] <wikibugs>	 (03PS10) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389)
[15:03:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr Server was in boot loop.   Pulled Add on 10g network card server completed pos...
[15:03:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:10] <icinga-wm>	 RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.58 ms
[15:06:40] <papaul>	 !log maintenance complete on mr1-esams
[15:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul)
[15:13:06] <wikibugs>	 10SRE, 10Parsoid, 10serviceops, 10User-brennen, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) Actually, we want to keep some task around to do another sprint on tackling more of our memory usage related issues at some point. Do you p...
[15:14:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35320 and previous config saved to /var/cache/conftool/dbconfig/20221003-151438-root.json
[15:15:55] <sukhe>	 !log disable Puppet on cp hosts in ulsfo: rolling out T309651
[15:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:59] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[15:16:58] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:24:18] <icinga-wm>	 RECOVERY - SSH on db1113.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T1530).
[15:36:59] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10MPhamWMF)
[15:57:18] <wikibugs>	 (03PS6) 10DDesouza: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331)
[16:03:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:05:38] <icinga-wm>	 PROBLEM - Check systemd state on cp4036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:07:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Ottomata) See https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations and https://meta.wikimedia.org/wiki/Research:FAQ#collaborations
[16:07:54] <icinga-wm>	 RECOVERY - Check systemd state on cp4036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:08:18] <wikibugs>	 (03PS1) 10Urbanecm: throttle: Add throttle rule for 2022-10-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837694 (https://phabricator.wikimedia.org/T319212)
[16:09:02] <wikibugs>	 (03PS1) 10DDesouza: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328)
[16:12:02] <wikibugs>	 (03PS1) 10Urbanecm: throttle: Remove out of date rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837696
[16:13:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] throttle: Add throttle rule for 2022-10-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837694 (https://phabricator.wikimedia.org/T319212) (owner: 10Urbanecm)
[16:13:55] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: Add throttle rule for 2022-10-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837694 (https://phabricator.wikimedia.org/T319212) (owner: 10Urbanecm)
[16:14:54] <sukhe>	 !log disable Puppet on cp hosts in codfw: rolling out T309651
[16:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:58] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[16:16:20] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in codfw to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[16:16:36] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:58] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cae49b85d2d780e34b553789d56d76bac4a62c48: throttle: Add throttle rule for 2022-10-06 (T319212) (duration: 04m 21s)
[16:19:02] <stashbot>	 T319212: Request a throttle lift for Czech senior citizens course - 2022-10-06 - https://phabricator.wikimedia.org/T319212
[16:19:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837696 (owner: 10Urbanecm)
[16:19:31] * urbanecm tries the new scap backport command
[16:19:37] <dancy>	 yay!
[16:20:08] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: Remove out of date rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837696 (owner: 10Urbanecm)
[16:20:29] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:837696|throttle: Remove out of date rules]]
[16:20:49] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:837696|throttle: Remove out of date rules]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[16:20:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:21:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:21:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:22:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:24:45] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:837696|throttle: Remove out of date rules]] (duration: 04m 16s)
[16:25:26] <urbanecm>	 and looks it's all done.
[16:26:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10dcaro) The server is good thanks!  It's syncing with the other, but I think this task can be closed 👍
[16:27:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:28:34] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson)
[16:28:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:28:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:29:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:31:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson)
[16:33:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 30781
[16:33:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30781
[16:34:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:37:04] <icinga-wm>	 PROBLEM - Check systemd state on cp2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:39:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:39:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:39:24] <icinga-wm>	 RECOVERY - Check systemd state on cp2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:41:43] <wikibugs>	 (03PS7) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[16:43:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:46:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[17:00:04] <jouncebot>	 ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T1700).
[17:01:03] <wikibugs>	 (03PS1) 10BBlack: dns4001: remove from various dns/ntp config [puppet] - 10https://gerrit.wikimedia.org/r/837704
[17:01:35] <wikibugs>	 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH)
[17:02:19] <wikibugs>	 (03PS2) 10BBlack: dns4001: remove from various dns/ntp config [puppet] - 10https://gerrit.wikimedia.org/r/837704 (https://phabricator.wikimedia.org/T319215)
[17:03:22] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] dns4001: remove from various dns/ntp config [puppet] - 10https://gerrit.wikimedia.org/r/837704 (https://phabricator.wikimedia.org/T319215) (owner: 10BBlack)
[17:04:12] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:04:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10Jclark-ctr) 05Open→03Resolved
[17:04:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) We are running ATS9 on all cp hosts in: codfw, ulsfo, drmrs, in addition to the existing hosts in eqiad, esams, eqsin, the site-wide deployment of which will...
[17:04:46] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:04:53] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns4001.wikimedia.org
[17:07:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew)
[17:08:58] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:09:58] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:10:14] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:10:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:11:03] <bblack>	 the anycast reports there are due to dns4001 being decommed, expected
[17:11:04] <sukhe>	 hmm
[17:11:08] <sukhe>	 oh right
[17:11:09] <sukhe>	 cool
[17:11:38] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:11:44] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:11:51] <bblack>	 same basic issue there
[17:11:58] <wikibugs>	 (03PS2) 10Andrew Bogott: Dumps: remove ensure->absent clause [puppet] - 10https://gerrit.wikimedia.org/r/837677
[17:12:00] <wikibugs>	 (03PS1) 10Andrew Bogott: Move labstore100[67] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/837726 (https://phabricator.wikimedia.org/T319217)
[17:12:50] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:12:50] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:12:50] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:12:50] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:13:03] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:13:04] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns4001.wikimedia.org
[17:13:08] <wikibugs>	 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `dns4001.wikimedia.org` - dns4001.wikimedia.org (**PASS**)   - Downtimed host o...
[17:13:32] <sukhe>	 bblack: since it is decommissioned, I guess we should remove it from homer too?
[17:13:37] <sukhe>	   anycast_neighbors:
[17:13:37] <sukhe>	     dns4001: {4: 198.35.26.7}
[17:13:48] <sukhe>	 happy to patch that
[17:13:52] <bblack>	 yeah, please do!
[17:13:59] <sukhe>	 onit
[17:15:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move labstore100[67] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/837726 (https://phabricator.wikimedia.org/T319217) (owner: 10Andrew Bogott)
[17:15:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:16:16] <wikibugs>	 (03PS8) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[17:16:23] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215)
[17:19:03] <wikibugs>	 (03PS1) 10BBlack: ntp.ulsfo: move to dns4002 for now [dns] - 10https://gerrit.wikimedia.org/r/837730 (https://phabricator.wikimedia.org/T319215)
[17:19:45] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[17:21:49] <wikibugs>	 (03PS1) 10Matthias Mullie: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837731 (https://phabricator.wikimedia.org/T306883)
[17:22:58] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) (owner: 10Ssingh)
[17:22:58] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:23:07] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] ntp.ulsfo: move to dns4002 for now [dns] - 10https://gerrit.wikimedia.org/r/837730 (https://phabricator.wikimedia.org/T319215) (owner: 10BBlack)
[17:23:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) (owner: 10Ssingh)
[17:24:38] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 66 probes of 777 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:25:25] <wikibugs>	 (03Merged) 10jenkins-bot: sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) (owner: 10Ssingh)
[17:29:03] <sukhe>	 !log running homer "cr*-ulsfo*" commit "Gerrit 837727: remove dns4001 for anycast neighbors."
[17:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:32] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[17:32:20] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:33:26] <wikibugs>	 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH)
[17:33:31] <wikibugs>	 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH) a:05RobH→03BBlack
[17:34:38] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 82, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:37:48] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 777 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:37:53] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:40:10] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:41:42] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns4003
[17:41:58] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:41:58] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns4003
[17:42:41] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[17:43:06] <wikibugs>	 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10wiki_willy) a:05wiki_willy→03RobH
[17:44:16] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:52:10] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:00:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:04:12] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:04:26] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:06:03] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall) 05In progress→03Resolved
[18:06:11] <wikibugs>	 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10BCornwall)
[18:06:12] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:12:47] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:19:00] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:19:08] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:21:56] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:30:05] <wikibugs>	 (03PS1) 10RobH: adding new dns4003 [puppet] - 10https://gerrit.wikimedia.org/r/837737 (https://phabricator.wikimedia.org/T317247)
[18:30:24] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4045.ulsfo.wmnet with OS buster
[18:30:53] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:30:57] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding new dns4003 [puppet] - 10https://gerrit.wikimedia.org/r/837737 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH)
[18:34:58] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye
[18:35:06] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye
[18:35:34] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) Further updates on this thread:  1. The installation attempts and debugging above were on **bullseye**, but our cp puppetization is actually still...
[18:41:58] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS bullseye
[18:42:06] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors:...
[18:45:57] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) I see our buster actually has `linux-image-5.10.0-0.deb10.17-amd64` available in its repos.  It may just be a matter of figuring out how to launch...
[18:48:58] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye
[18:49:05] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye
[18:51:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:56:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:57:59] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[18:58:56] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "lgtm as long as the numeric uid isn't changing" [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff)
[19:03:20] <icinga-wm>	 RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms
[19:04:00] <icinga-wm>	 RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.32 ms
[19:09:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[19:09:13] <gehel>	 ryankemper: ^
[19:09:33] <ryankemper>	 gehel: looking
[19:10:30] <wikibugs>	 (03PS1) 10Zabe: vcl: stop overriding cache-control header for bad title errors [puppet] - 10https://gerrit.wikimedia.org/r/837742 (https://phabricator.wikimedia.org/T316932)
[19:12:39] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) I've also found some other breadcrumbs.  Runtime buster + 5.10 support is puppetized in `modules/profile/manifests/base/linux510.pp`.  There's ins...
[19:15:46] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS bullseye
[19:15:50] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors: - dns4003 (**FAIL**)...
[19:20:48] <wikibugs>	 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10RobH) p:05Triage→03Medium
[19:22:17] <ryankemper>	 !log [Elastic] Banned `elastic1066` (`curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic1066-production-search-psi-eqiad"}}}'`); will restart elasticsearch-psi after shards drain
[19:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[19:24:06] <icinga-wm>	 PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:24:06] <icinga-wm>	 PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:24:18] <icinga-wm>	 PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[19:25:11] <robh>	 crap thats me
[19:25:16] <robh>	 i forgot to hit enter on log
[19:25:32] <robh>	 !log msw1-ulsfo swap, some mgmt flapping expected, swap complete but not powered back up yet
[19:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:36] <icinga-wm>	 PROBLEM - Host cr3-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:25:48] <icinga-wm>	 PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:25:48] <icinga-wm>	 PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:25:54] <icinga-wm>	 PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:25:54] <icinga-wm>	 PROBLEM - Host cp4033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:27:32] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:27:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:27:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:27:56] <icinga-wm>	 PROBLEM - Host ganeti4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:27:56] <icinga-wm>	 PROBLEM - Host ganeti4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:30:34] <robh>	 ok, they shoudl start coming bakvc
[19:30:57] <icinga-wm>	 RECOVERY - Host cp4033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[19:32:05] <robh>	 !log msw1-ulsfo swap successful, mgmt recovering in icinga and tested connection with 3 servers all work
[19:32:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:32:59] <icinga-wm>	 RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms
[19:32:59] <icinga-wm>	 RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.51 ms
[19:34:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:35:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.494 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:35:49] <icinga-wm>	 RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.32 ms
[19:35:49] <icinga-wm>	 RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.23 ms
[19:36:05] <icinga-wm>	 RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.84 ms
[19:36:11] <icinga-wm>	 RECOVERY - Host cr3-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.17 ms
[19:36:13] <icinga-wm>	 RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.30 ms
[19:36:13] <icinga-wm>	 RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms
[19:36:21] <icinga-wm>	 RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.28 ms
[19:37:04] <ryankemper>	 !log [Elastic] Restarted psi on `elastic1066`; will unban host after process is up and running
[19:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:49] <icinga-wm>	 PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:41:25] <ryankemper>	 !log [Elastic] Unbanned `elastic1066`
[19:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:37] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:58:37] <icinga-wm>	 RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T2000).
[20:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:23] <TheresNoTime>	 Indeed, nothing in the queue
[20:03:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:21:27] <wikibugs>	 10SRE, 10MediaWiki-Uploading, 10MW-1.37-notes, 10MW-1.38-notes, and 4 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) 05Open→03Resolved
[20:28:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[20:41:57] <icinga-wm>	 RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:47:29] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) The bits for the reimage cookbooks are trivial to do, Spicerack has already support for custom images, see the `media_type` argument to https://do...
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T2100).
[21:07:37] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:18:47] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye
[21:18:51] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye
[21:18:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:27:25] <wikibugs>	 (03PS5) 10SBassett: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[21:27:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[21:33:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:42:09] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) dns4003 is getting stuck in the reimage at:   ` 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. 100.0% (1/1) succes...
[21:44:00] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS bullseye
[21:44:03] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors: - dns4003 (**FAIL**)...
[21:44:31] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[21:45:44] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:51:20] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751
[21:54:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott)
[22:18:56] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] admin: Revoke my ssh key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/837079 (owner: 10Ladsgroup)
[22:21:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:26:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:51:57] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:31:01] <wikibugs>	 (03PS1) 10Stang: throttle: Add throttle rule for 2022-10-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837756 (https://phabricator.wikimedia.org/T319244)
[23:36:56] <wikibugs>	 (03PS1) 10Stang: ukwiki: Create flood group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837757 (https://phabricator.wikimedia.org/T319243)
[23:53:17] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:58:11] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Add profiling for Varnish and VCL - https://phabricator.wikimedia.org/T175710 (10Krinkle) 05Open→03Declined I'm declining this as I no longer believe this is an important need for the original objective. I think Varnish is sufficiently standalone and s...