[00:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:22:43] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:43:16] !issync [00:43:17] Syncing #wikimedia-operations (requested by legoktm) [00:43:18] Set /cs flags #wikimedia-operations Majavah +Aiotv [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:09] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) resolved: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:29:25] (03PS1) 10Marostegui: db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837465 [05:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P35244 and previous config saved to /var/cache/conftool/dbconfig/20221003-052927-root.json [05:32:13] (03CR) 10Marostegui: [C: 03+2] db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837465 (owner: 10Marostegui) [05:42:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35245 and previous config saved to /var/cache/conftool/dbconfig/20221003-054206-root.json [05:42:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1167', diff saved to https://phabricator.wikimedia.org/P35246 and previous config saved to /var/cache/conftool/dbconfig/20221003-054245-root.json [05:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35247 and previous config saved to /var/cache/conftool/dbconfig/20221003-055052-root.json [05:51:08] (03PS1) 10Marostegui: Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837001 [05:53:32] (03CR) 10Marostegui: [C: 03+2] Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837001 (owner: 10Marostegui) [05:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158', diff saved to https://phabricator.wikimedia.org/P35248 and previous config saved to /var/cache/conftool/dbconfig/20221003-055401-root.json [05:57:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35249 and previous config saved to /var/cache/conftool/dbconfig/20221003-055711-root.json [06:04:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15133 [06:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35250 and previous config saved to /var/cache/conftool/dbconfig/20221003-060557-root.json [06:07:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15133 [06:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35251 and previous config saved to /var/cache/conftool/dbconfig/20221003-061216-root.json [06:13:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3300 [06:15:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3300 [06:20:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35252 and previous config saved to /var/cache/conftool/dbconfig/20221003-062022-root.json [06:20:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:21:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35253 and previous config saved to /var/cache/conftool/dbconfig/20221003-062102-root.json [06:25:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:26:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 5400 [06:27:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5400 [06:27:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35254 and previous config saved to /var/cache/conftool/dbconfig/20221003-062721-root.json [06:30:28] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11039 [06:30:30] (03PS1) 10Marostegui: mariadb: Remove innodb_large_prefix flag. [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) [06:30:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11039 [06:33:08] (03PS2) 10Marostegui: mariadb: Remove innodb_large_prefix flag. [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) [06:35:01] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:35:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35255 and previous config saved to /var/cache/conftool/dbconfig/20221003-063527-root.json [06:36:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35256 and previous config saved to /var/cache/conftool/dbconfig/20221003-063607-root.json [06:42:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35257 and previous config saved to /var/cache/conftool/dbconfig/20221003-064226-root.json [06:46:03] (03PS1) 10Marostegui: db1182: Upgrade from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837491 (https://phabricator.wikimedia.org/T301879) [06:46:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P35258 and previous config saved to /var/cache/conftool/dbconfig/20221003-064638-root.json [06:47:13] (03CR) 10Marostegui: [C: 03+2] db1182: Upgrade from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837491 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [06:48:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 6128 [06:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35259 and previous config saved to /var/cache/conftool/dbconfig/20221003-065031-root.json [06:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35260 and previous config saved to /var/cache/conftool/dbconfig/20221003-065112-root.json [06:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35261 and previous config saved to /var/cache/conftool/dbconfig/20221003-065154-root.json [06:52:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 6128 [06:56:06] (03PS1) 10Marostegui: mariadb: Remove semi_sync plugin [puppet] - 10https://gerrit.wikimedia.org/r/837492 (https://phabricator.wikimedia.org/T318914) [06:57:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35262 and previous config saved to /var/cache/conftool/dbconfig/20221003-065731-root.json [06:58:17] (03PS1) 10Marostegui: db2175: Upgrade mariadb to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837493 (https://phabricator.wikimedia.org/T318914) [06:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2175', diff saved to https://phabricator.wikimedia.org/P35263 and previous config saved to /var/cache/conftool/dbconfig/20221003-065844-root.json [07:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:13] (03CR) 10Marostegui: [C: 03+2] db2175: Upgrade mariadb to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837493 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui) [07:01:15] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:03:29] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35264 and previous config saved to /var/cache/conftool/dbconfig/20221003-070431-root.json [07:05:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35265 and previous config saved to /var/cache/conftool/dbconfig/20221003-070536-root.json [07:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35266 and previous config saved to /var/cache/conftool/dbconfig/20221003-070617-root.json [07:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 3%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35267 and previous config saved to /var/cache/conftool/dbconfig/20221003-070659-root.json [07:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35268 and previous config saved to /var/cache/conftool/dbconfig/20221003-071236-root.json [07:16:53] (03PS1) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [07:19:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35269 and previous config saved to /var/cache/conftool/dbconfig/20221003-071936-root.json [07:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35270 and previous config saved to /var/cache/conftool/dbconfig/20221003-072041-root.json [07:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35271 and previous config saved to /var/cache/conftool/dbconfig/20221003-072122-root.json [07:22:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35272 and previous config saved to /var/cache/conftool/dbconfig/20221003-072204-root.json [07:26:07] (03CR) 10Marostegui: [C: 03+1] "So this didn't work even with binlog disabled?" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [07:27:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35273 and previous config saved to /var/cache/conftool/dbconfig/20221003-072741-root.json [07:34:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35274 and previous config saved to /var/cache/conftool/dbconfig/20221003-073441-root.json [07:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35275 and previous config saved to /var/cache/conftool/dbconfig/20221003-073546-root.json [07:35:54] (03PS1) 10Marostegui: db1120: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837497 (https://phabricator.wikimedia.org/T301879) [07:35:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1200', diff saved to https://phabricator.wikimedia.org/P35276 and previous config saved to /var/cache/conftool/dbconfig/20221003-073556-root.json [07:36:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35277 and previous config saved to /var/cache/conftool/dbconfig/20221003-073627-root.json [07:36:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1200.eqiad.wmnet with reason: Upgrade to 10.6 [07:36:40] !log cr2-drmrs# set chassis fpc 0 sampling-instance pmacct [07:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1200.eqiad.wmnet with reason: Upgrade to 10.6 [07:36:58] 10SRE, 10observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10fgiunchedi) >>! In T318911#8272789, @ssingh wrote: > So to summarize, a short term fix can be to delete the misbehaving CT log (https://yeti2023.ct.digicert.com/log/). But a long term fix needs to include... [07:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35278 and previous config saved to /var/cache/conftool/dbconfig/20221003-073709-root.json [07:37:49] (03CR) 10Marostegui: [C: 03+2] db1120: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837497 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [07:39:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35279 and previous config saved to /var/cache/conftool/dbconfig/20221003-073944-root.json [07:42:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16637 [07:42:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16637 [07:49:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35280 and previous config saved to /var/cache/conftool/dbconfig/20221003-074946-root.json [07:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35281 and previous config saved to /var/cache/conftool/dbconfig/20221003-075051-root.json [07:51:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35282 and previous config saved to /var/cache/conftool/dbconfig/20221003-075214-root.json [07:54:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35283 and previous config saved to /var/cache/conftool/dbconfig/20221003-075449-root.json [07:56:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2178', diff saved to https://phabricator.wikimedia.org/P35284 and previous config saved to /var/cache/conftool/dbconfig/20221003-075643-root.json [07:56:55] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:57:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6 [07:57:26] (03PS1) 10Marostegui: db2178: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837614 (https://phabricator.wikimedia.org/T301879) [07:57:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6 [07:58:28] (03CR) 10Marostegui: [C: 03+2] db2178: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/837614 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [07:59:50] (03CR) 10David Caro: [C: 03+2] flake8: Several pep8/flake8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/837126 (owner: 10David Caro) [08:00:13] 10SRE: Add PKI support to Pontoon - https://phabricator.wikimedia.org/T319163 (10fgiunchedi) [08:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:04:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35285 and previous config saved to /var/cache/conftool/dbconfig/20221003-080451-root.json [08:05:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6 [08:05:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Upgrade to 10.6 [08:05:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16509 [08:05:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35286 and previous config saved to /var/cache/conftool/dbconfig/20221003-080556-root.json [08:05:59] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'email' for AS: 16509 [08:07:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35287 and previous config saved to /var/cache/conftool/dbconfig/20221003-080719-root.json [08:09:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35288 and previous config saved to /var/cache/conftool/dbconfig/20221003-080954-root.json [08:12:42] (03PS1) 10Marostegui: db2178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837615 [08:13:18] (03CR) 10Marostegui: [C: 03+2] db2178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837615 (owner: 10Marostegui) [08:16:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 39386 [08:19:28] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10dcaro) [08:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35289 and previous config saved to /var/cache/conftool/dbconfig/20221003-081955-root.json [08:21:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 39386 [08:22:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35290 and previous config saved to /var/cache/conftool/dbconfig/20221003-082224-root.json [08:23:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30781 [08:24:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30781 [08:25:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35291 and previous config saved to /var/cache/conftool/dbconfig/20221003-082459-root.json [08:26:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12975 [08:26:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12975 [08:26:59] 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request - Calculator Service - https://phabricator.wikimedia.org/T273807 (10Joe) 05Open→03Invalid Closing as invalid because I don't think we need this anymore [08:28:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15557 [08:29:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15557 [08:29:27] 10ops-eqsin, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10Vgutierrez) [08:30:00] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Joe) p:05Triage→03High [08:30:06] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) 05Open→03Resolved >>! In T314256#8275011, @MoritzMuehlenhoff wrote: > Traffic folks, can be please go ahead and fully decom cp5001, then? Right now this is in a weird limb... [08:30:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp5001.eqsin.wmnet [08:34:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12956 [08:35:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35292 and previous config saved to /var/cache/conftool/dbconfig/20221003-083502-root.json [08:35:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12956 [08:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:18] !log vgutierrez@cumin1001 START - Cookbook sre.dns.netbox [08:37:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: After upgrade to 10.6', diff saved to https://phabricator.wikimedia.org/P35293 and previous config saved to /var/cache/conftool/dbconfig/20221003-083729-root.json [08:38:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3303 [08:39:40] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 3303 [08:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35294 and previous config saved to /var/cache/conftool/dbconfig/20221003-084004-root.json [08:40:05] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:06] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp5001.eqsin.wmnet [08:40:09] 10ops-eqsin, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp5001.eqsin.wmnet` - cp5001.eqsin.wmnet (**FAIL**) - //Host not found on Icinga, unable to do... [08:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:41:15] (03CR) 10David Caro: alerts.downtime_host: attempt to match alert hostnames with : (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [08:48:04] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10dcaro) [08:48:10] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10dcaro) [08:48:54] 10SRE, 10Pontoon: Add PKI support to Pontoon - https://phabricator.wikimedia.org/T319163 (10Aklapper) [08:50:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35295 and previous config saved to /var/cache/conftool/dbconfig/20221003-085007-root.json [08:51:40] 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10Vgutierrez) a:03wiki_willy [08:52:04] 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10Vgutierrez) [08:53:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 12975 [08:54:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12975 [08:55:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35296 and previous config saved to /var/cache/conftool/dbconfig/20221003-085509-root.json [08:58:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2157', diff saved to https://phabricator.wikimedia.org/P35297 and previous config saved to /var/cache/conftool/dbconfig/20221003-085840-root.json [08:59:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db[2157,2178].codfw.wmnet with reason: Reclone [08:59:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db[2157,2178].codfw.wmnet with reason: Reclone [09:01:37] (03PS1) 10Marostegui: db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837619 (https://phabricator.wikimedia.org/T319169) [09:02:17] (03CR) 10Marostegui: [C: 03+2] db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/837619 (https://phabricator.wikimedia.org/T319169) (owner: 10Marostegui) [09:02:34] (03PS4) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) [09:10:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35299 and previous config saved to /var/cache/conftool/dbconfig/20221003-091014-root.json [09:11:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62044 [09:11:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62044 [09:14:21] (03CR) 10Elukey: Update calico to v3.23.3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:16:29] (03PS5) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) [09:19:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:23] (03CR) 10Elukey: [C: 03+1] "Left some nits, you can freely skip them if they are not worth it. The changes look good, even if I don't have a lot of context in what ch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:20:12] (03CR) 10Elukey: [C: 03+1] "LGTM (modulo the calico-specific changes, I didn't check all of them)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:21:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31133 [09:22:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31133 [09:24:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35300 and previous config saved to /var/cache/conftool/dbconfig/20221003-092519-root.json [09:28:54] (03PS6) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) [09:30:24] (03CR) 10Vgutierrez: [C: 03+2] "looking good on https://grafana.wikimedia.org/dashboard/snapshot/0gMtUk3zjPMMopv9BCKSUxykeXHb1zXm?orgId=1" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [09:30:57] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [09:31:02] (03PS1) 10Elukey: Move kafka-logging2001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) [09:32:48] 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) 05Open→03Resolved SLO dashboard available in https://grafana.wikimedia.org/d/slo-trafficserver-tmpl/trafficserver-slos-grizzly-template?orgId=1 [09:33:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37403/console" [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [09:33:22] (03PS2) 10Elukey: Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) [09:33:30] (03CR) 10Elukey: Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [09:34:33] (03PS3) 10Vgutierrez: varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) [09:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:45:18] (03CR) 10Vgutierrez: [C: 03+2] varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [09:47:56] (03CR) 10Vgutierrez: [C: 03+2] mtail:varnishsli: Track client sided requests only [puppet] - 10https://gerrit.wikimedia.org/r/834525 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [09:59:08] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:00:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore1002.eqiad.wmnet with reason: Prep for reimage [10:00:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore1002.eqiad.wmnet with reason: Prep for reimage [10:00:56] !log c-foreach-nt drain on sessionstore1002 [10:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:02] (03PS2) 10AOkoth: vrts: enable vrts-daemon on WMCS instance [puppet] - 10https://gerrit.wikimedia.org/r/834510 (https://phabricator.wikimedia.org/T317059) [10:05:02] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1002.eqiad.wmnet with OS buster [10:07:43] (03PS1) 10Arturo Borrero Gonzalez: interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 [10:07:45] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) [10:08:30] (03PS2) 10Arturo Borrero Gonzalez: interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 [10:08:32] (03PS2) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) [10:10:50] (03CR) 10CI reject: [V: 04-1] interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez) [10:12:33] (03PS3) 10Arturo Borrero Gonzalez: interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 [10:12:35] (03PS3) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) [10:16:42] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1002.eqiad.wmnet with reason: host reimage [10:19:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1002.eqiad.wmnet with reason: host reimage [10:23:09] (03CR) 10Jcrespo: [C: 03+1] "This is ok to me, but probably btullis should ok it too." [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [10:24:38] (03CR) 10Jcrespo: [C: 03+1] mariadb: Remove semi_sync plugin [puppet] - 10https://gerrit.wikimedia.org/r/837492 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui) [10:25:04] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/37406/" [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez) [10:26:56] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/37407/" [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [10:27:01] (03CR) 10Jcrespo: "The patch will work, and I think we still should merge it to make sure it behaves in the same/expected way- but this didn't fix the s7 imp" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [10:30:41] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:50] (03PS1) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) [10:31:05] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [10:31:13] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:18] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [10:32:11] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=eqiad [10:37:08] <_joe_> !log remove stale druid.svc.eqiad.wmnet certificate from the puppetmaster CA; it was expired anyways [10:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:50] !log starting cassandra on reimaged sessionstore1002 [10:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:39] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [10:40:54] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [10:41:00] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1002.eqiad.wmnet with OS buster [10:41:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=eqiad [10:48:48] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore1003.eqiad.wmnet with reason: Prep for reimage [10:49:01] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [10:49:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore1003.eqiad.wmnet with reason: Prep for reimage [10:52:56] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1003.eqiad.wmnet with OS buster [11:02:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/836790 (owner: 10Muehlenhoff) [11:04:50] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1003.eqiad.wmnet with reason: host reimage [11:05:01] (03CR) 10Jbond: [C: 03+1] dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:05:11] (03CR) 10Jbond: [C: 03+1] tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:05:24] (03CR) 10Jbond: [C: 03+1] alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837097 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:05:39] (03CR) 10Jbond: [C: 03+1] mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:06:54] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) [11:08:12] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1003.eqiad.wmnet with reason: host reimage [11:09:25] (03PS2) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) [11:20:20] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=eqiad [11:27:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [11:27:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1003.eqiad.wmnet with OS buster [11:28:02] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [11:28:29] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=eqiad [11:29:25] (03PS3) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) [11:31:32] 10SRE, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) 05Open→03Invalid Thumbor is being migrated to k8s, making this task invalid :) [11:31:40] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10jijiki) [11:49:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez) [11:50:14] (03CR) 10Jbond: [C: 03+1] grub: Update includes [puppet] - 10https://gerrit.wikimedia.org/r/836855 (owner: 10Muehlenhoff) [11:51:08] (03CR) 10Jbond: [C: 03+1] Also apply labweb->cloudweb rename for the Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836795 (owner: 10Muehlenhoff) [11:51:57] (03CR) 10Jbond: [C: 03+1] bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070 (owner: 10Muehlenhoff) [11:54:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1117.eqiad.wmnet with reason: Reboot [11:54:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1117.eqiad.wmnet with reason: Reboot [11:54:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35302 and previous config saved to /var/cache/conftool/dbconfig/20221003-115449-root.json [11:59:45] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:59:49] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:59:55] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:00:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1116.eqiad.wmnet with reason: Reboot [12:00:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1116.eqiad.wmnet with reason: Reboot [12:01:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Cloning [12:01:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Cloning [12:01:57] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:01:59] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:02:00] (03PS2) 10Hashar: Allow SRE to send annotated and signed tags [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 [12:02:05] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:02:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2123', diff saved to https://phabricator.wikimedia.org/P35303 and previous config saved to /var/cache/conftool/dbconfig/20221003-120208-root.json [12:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:04:50] (03CR) 10Hashar: Allow SRE to send annotated and signed tags (031 comment) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/836711 (owner: 10Hashar) [12:09:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove semi_sync plugin [puppet] - 10https://gerrit.wikimedia.org/r/837492 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui) [12:09:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35305 and previous config saved to /var/cache/conftool/dbconfig/20221003-120954-root.json [12:14:29] (03CR) 10Marostegui: [C: 03+1] mariadb: Set binlog format for dbstore mariadb databases to ROW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [12:15:47] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused Debian Buster code [puppet] - 10https://gerrit.wikimedia.org/r/837656 [12:17:33] (03CR) 10Arturo Borrero Gonzalez: "PCC NOOP: https://puppet-compiler.wmflabs.org/pcc-worker1001/37413/" [puppet] - 10https://gerrit.wikimedia.org/r/837656 (owner: 10Arturo Borrero Gonzalez) [12:24:40] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/837656 (owner: 10Arturo Borrero Gonzalez) [12:25:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35306 and previous config saved to /var/cache/conftool/dbconfig/20221003-122459-root.json [12:36:54] (03CR) 10Vgutierrez: "text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) (owner: 10Vgutierrez) [12:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35307 and previous config saved to /var/cache/conftool/dbconfig/20221003-124004-root.json [12:40:15] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:43:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: cleanup unused Debian Buster code [puppet] - 10https://gerrit.wikimedia.org/r/837656 (owner: 10Arturo Borrero Gonzalez) [12:45:07] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [12:48:51] (03PS2) 10Andrew Bogott: Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/835192 (https://phabricator.wikimedia.org/T309346) [12:49:37] (03Abandoned) 10Majavah: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815750 (owner: 10Majavah) [12:51:09] (03CR) 10Andrew Bogott: [C: 03+2] Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/835192 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [12:52:59] (03Merged) 10jenkins-bot: Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [12:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35308 and previous config saved to /var/cache/conftool/dbconfig/20221003-125509-root.json [12:59:44] (03PS4) 10Vgutierrez: varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:01:58] (03CR) 10BBlack: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) (owner: 10Vgutierrez) [13:04:05] (03CR) 10Majavah: [C: 04-1] prometheus: Add new scrape target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/836310 (owner: 10Raymond Ndibe) [13:05:16] (03CR) 10Vgutierrez: [C: 03+2] varnish: Enforce RFC 9112 request-target definition [puppet] - 10https://gerrit.wikimedia.org/r/837633 (https://phabricator.wikimedia.org/T318676) (owner: 10Vgutierrez) [13:09:03] PROBLEM - Host db1189.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35310 and previous config saved to /var/cache/conftool/dbconfig/20221003-131014-root.json [13:12:17] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Replaced Failed Dimm. Thanks @Marostegui [13:12:27] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) 05Open→03Resolved [13:14:42] (03CR) 10DCausse: [C: 04-1] Update elasticsearch memory pressure alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson) [13:15:23] RECOVERY - Host db1189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [13:18:00] !log enforcing origin-form|asterisk-form for request-target on varnish (could trigger spikes of HTTP 400 errors) - T318676 [13:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:09] T318676: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 [13:18:59] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:22:35] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Thanks John - I will take it from here and ping you if we have more issues! [13:25:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35311 and previous config saved to /var/cache/conftool/dbconfig/20221003-132519-root.json [13:25:29] just as a heads-up: vgutierrez and I will be upgrading to ATS9 on all cp hosts in codfw and ulsfo today. no impact expected and the caches should be preserved. see T309651 [13:25:30] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [13:25:34] (03PS1) 10Marostegui: Revert "db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837003 [13:27:08] (03CR) 10Marostegui: [C: 03+2] Revert "db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/837003 (owner: 10Marostegui) [13:29:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35312 and previous config saved to /var/cache/conftool/dbconfig/20221003-132902-root.json [13:30:12] (03PS10) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) [13:31:26] (03PS1) 10Ssingh: hiera: upgrade cp hosts in codfw to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) [13:31:35] (03CR) 10Hashar: "Rebased due to "conflict" with I74744310538d780cff88e24b646675ad33630eb9" [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [13:31:38] 10SRE, 10Parsoid, 10serviceops, 10User-brennen, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10jijiki) @ssastry please let us know if there is anything more to be done in this task, if nor, we can resolve it [13:31:43] (03PS6) 10Hashar: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507 [13:31:50] (03PS4) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379 [13:31:56] (03PS4) 10Hashar: gerrit: use daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385 [13:32:47] (03PS2) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [13:32:49] (03PS1) 10Giuseppe Lavagetto: termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 [13:34:21] (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [13:34:23] (03CR) 10CI reject: [V: 04-1] termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 (owner: 10Giuseppe Lavagetto) [13:34:38] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Jclark-ctr) @wiki_willy This server is out of warranty. We do not have any spare 1.9tb SSD. Largest i have is 1.6tb. [13:34:45] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Jclark-ctr) a:03Jclark-ctr [13:37:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but keep my comment in mind." [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [13:37:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37414/console" [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:38:29] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenSent - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:40] (03PS1) 10Ssingh: hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) [13:39:55] (03PS2) 10Ssingh: hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) [13:40:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35313 and previous config saved to /var/cache/conftool/dbconfig/20221003-134024-root.json [13:41:00] (03CR) 10Clément Goubert: [V: 03+1] parsoid: Cleanup post php7.4 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [13:41:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37416/console" [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:42:33] (03CR) 10Ssingh: [V: 03+1] "NOOP on 4032 as it's already running ATS9 (additional confirmation)." [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:44:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35314 and previous config saved to /var/cache/conftool/dbconfig/20221003-134407-root.json [13:51:47] 10SRE, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10jijiki) 05Open→03Resolved a:03jijiki Please reopen if needed [13:57:50] !log reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm2_amd64.changes: T309651 [13:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:54] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [13:58:50] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Andrew) @jclark, we're not using storage on this system so there's no need to replace the drive or worry about it. I've already rebuilt the raid to exclude the broken drive. What... [13:59:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35315 and previous config saved to /var/cache/conftool/dbconfig/20221003-135912-root.json [13:59:58] 10SRE, 10serviceops: Increase of varnish-be failed fetches error due to "http format error" - https://phabricator.wikimedia.org/T235254 (10jijiki) 05Open→03Resolved a:03jijiki no activity, closing for now [14:00:46] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:06:20] (03PS3) 10Andrew Bogott: Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) [14:07:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook) [14:07:20] (03CR) 10CI reject: [V: 04-1] Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [14:08:30] !log upgrade cp4026, cp4032 to ATS 9.1.3-1wm2 from 9.1.3-1wm1: T309651 [14:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:34] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [14:10:01] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10Papaul) Create Dispatch: Success You have successfully submitted request SR153002644. [14:10:10] (03PS4) 10Andrew Bogott: Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) [14:10:12] (03PS1) 10Andrew Bogott: Dumps: remove ensure->absent clause [puppet] - 10https://gerrit.wikimedia.org/r/837677 [14:12:56] (03CR) 10Andrew Bogott: [C: 03+2] Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [14:13:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] interface: factorize interface renaming function [puppet] - 10https://gerrit.wikimedia.org/r/837630 (owner: 10Arturo Borrero Gonzalez) [14:14:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35316 and previous config saved to /var/cache/conftool/dbconfig/20221003-141417-root.json [14:20:46] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:23:19] PROBLEM - SSH on db1113.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:26:37] (03PS2) 10Giuseppe Lavagetto: mediawiki::canary: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/835506 (https://phabricator.wikimedia.org/T318894) [14:26:39] (03PS1) 10Giuseppe Lavagetto: mediawiki::canary: cleanup php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/837681 (https://phabricator.wikimedia.org/T318894) [14:28:23] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) This is now done. I'm going to gradually dismantle the old dumps servers but will probably leave their data intact f... [14:28:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [14:28:53] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [14:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10Andrew) 05Open→03Resolved [14:29:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35317 and previous config saved to /var/cache/conftool/dbconfig/20221003-142923-root.json [14:30:43] (03CR) 10Andrew Bogott: [C: 03+1] cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [14:31:03] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [14:31:42] (03PS4) 10Arturo Borrero Gonzalez: cloudnet1005/1006: prepare for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) [14:31:47] !log on going maintenance on mr1-esams [14:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:35:21] !log upgrade A:cp and A:drmrs to ATS 9.1.3-1wm2 from 9.1.3-1wm1: T309651 [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:25] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [14:36:37] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [14:44:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35318 and previous config saved to /var/cache/conftool/dbconfig/20221003-144428-root.json [14:48:45] PROBLEM - Host asw2-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:53:33] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:53:36] 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) [14:53:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:53:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:39] PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:29] 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) T317660 has been fixed by the shipping of trafficserver 9.1.3-1wm2 including https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/834045 [14:58:27] (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:58:58] (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in codfw to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:59:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35319 and previous config saved to /var/cache/conftool/dbconfig/20221003-145933-root.json [15:01:32] RECOVERY - Host asw2-esams is UP: PING OK - Packet loss = 0%, RTA = 81.58 ms [15:01:40] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:01:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:02:05] (03PS2) 10Ebernhardson: Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180 [15:02:07] (03CR) 10Ebernhardson: Update elasticsearch memory pressure alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson) [15:02:35] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) Synced on IRC, we're aiming at Thursday 1pm UTC. [15:02:42] (03PS10) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [15:03:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr Server was in boot loop. Pulled Add on 10g network card server completed pos... [15:03:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:10] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.58 ms [15:06:40] !log maintenance complete on mr1-esams [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:11] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [15:13:06] 10SRE, 10Parsoid, 10serviceops, 10User-brennen, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) Actually, we want to keep some task around to do another sprint on tackling more of our memory usage related issues at some point. Do you p... [15:14:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35320 and previous config saved to /var/cache/conftool/dbconfig/20221003-151438-root.json [15:15:55] !log disable Puppet on cp hosts in ulsfo: rolling out T309651 [15:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:59] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [15:16:58] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in ulsfo to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837673 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:24:18] RECOVERY - SSH on db1113.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:30:05] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T1530). [15:36:59] 10SRE, 10Discovery-Search (Current work): Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10MPhamWMF) [15:57:18] (03PS6) 10DDesouza: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) [16:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:05:38] PROBLEM - Check systemd state on cp4036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:27] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Ottomata) See https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations and https://meta.wikimedia.org/wiki/Research:FAQ#collaborations [16:07:54] RECOVERY - Check systemd state on cp4036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:18] (03PS1) 10Urbanecm: throttle: Add throttle rule for 2022-10-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837694 (https://phabricator.wikimedia.org/T319212) [16:09:02] (03PS1) 10DDesouza: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) [16:12:02] (03PS1) 10Urbanecm: throttle: Remove out of date rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837696 [16:13:10] (03CR) 10Urbanecm: [C: 03+2] throttle: Add throttle rule for 2022-10-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837694 (https://phabricator.wikimedia.org/T319212) (owner: 10Urbanecm) [16:13:55] (03Merged) 10jenkins-bot: throttle: Add throttle rule for 2022-10-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837694 (https://phabricator.wikimedia.org/T319212) (owner: 10Urbanecm) [16:14:54] !log disable Puppet on cp hosts in codfw: rolling out T309651 [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:58] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [16:16:20] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in codfw to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/837670 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [16:16:36] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:58] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cae49b85d2d780e34b553789d56d76bac4a62c48: throttle: Add throttle rule for 2022-10-06 (T319212) (duration: 04m 21s) [16:19:02] T319212: Request a throttle lift for Czech senior citizens course - 2022-10-06 - https://phabricator.wikimedia.org/T319212 [16:19:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837696 (owner: 10Urbanecm) [16:19:31] * urbanecm tries the new scap backport command [16:19:37] yay! [16:20:08] (03Merged) 10jenkins-bot: throttle: Remove out of date rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837696 (owner: 10Urbanecm) [16:20:29] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:837696|throttle: Remove out of date rules]] [16:20:49] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:837696|throttle: Remove out of date rules]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [16:20:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:21:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:21:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:22:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:24:45] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:837696|throttle: Remove out of date rules]] (duration: 04m 16s) [16:25:26] and looks it's all done. [16:26:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10dcaro) The server is good thanks! It's syncing with the other, but I think this task can be closed 👍 [16:27:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:28:34] (03CR) 10DCausse: [C: 03+2] Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson) [16:28:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:28:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:29:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:31:02] (03Merged) 10jenkins-bot: Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180 (owner: 10Ebernhardson) [16:33:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 30781 [16:33:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30781 [16:34:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:37:04] PROBLEM - Check systemd state on cp2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:39:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:39:24] RECOVERY - Check systemd state on cp2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:43] (03PS7) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [16:43:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:46:04] (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [17:00:04] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T1700). [17:01:03] (03PS1) 10BBlack: dns4001: remove from various dns/ntp config [puppet] - 10https://gerrit.wikimedia.org/r/837704 [17:01:35] 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH) [17:02:19] (03PS2) 10BBlack: dns4001: remove from various dns/ntp config [puppet] - 10https://gerrit.wikimedia.org/r/837704 (https://phabricator.wikimedia.org/T319215) [17:03:22] (03CR) 10BBlack: [C: 03+2] dns4001: remove from various dns/ntp config [puppet] - 10https://gerrit.wikimedia.org/r/837704 (https://phabricator.wikimedia.org/T319215) (owner: 10BBlack) [17:04:12] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10Jclark-ctr) 05Open→03Resolved [17:04:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) We are running ATS9 on all cp hosts in: codfw, ulsfo, drmrs, in addition to the existing hosts in eqiad, esams, eqsin, the site-wide deployment of which will... [17:04:46] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:53] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns4001.wikimedia.org [17:07:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [17:08:58] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:09:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:11:03] the anycast reports there are due to dns4001 being decommed, expected [17:11:04] hmm [17:11:08] oh right [17:11:09] cool [17:11:38] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:11:44] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:11:51] same basic issue there [17:11:58] (03PS2) 10Andrew Bogott: Dumps: remove ensure->absent clause [puppet] - 10https://gerrit.wikimedia.org/r/837677 [17:12:00] (03PS1) 10Andrew Bogott: Move labstore100[67] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/837726 (https://phabricator.wikimedia.org/T319217) [17:12:50] ACKNOWLEDGEMENT - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:12:50] ACKNOWLEDGEMENT - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:50] ACKNOWLEDGEMENT - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:12:50] ACKNOWLEDGEMENT - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast Brandon Black Triggered by dns4001 decom in T319215 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:13:03] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:04] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns4001.wikimedia.org [17:13:08] 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `dns4001.wikimedia.org` - dns4001.wikimedia.org (**PASS**) - Downtimed host o... [17:13:32] bblack: since it is decommissioned, I guess we should remove it from homer too? [17:13:37] anycast_neighbors: [17:13:37] dns4001: {4: 198.35.26.7} [17:13:48] happy to patch that [17:13:52] yeah, please do! [17:13:59] onit [17:15:12] (03CR) 10Andrew Bogott: [C: 03+2] Move labstore100[67] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/837726 (https://phabricator.wikimedia.org/T319217) (owner: 10Andrew Bogott) [17:15:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:16:16] (03PS8) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [17:16:23] (03PS1) 10Ssingh: sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) [17:19:03] (03PS1) 10BBlack: ntp.ulsfo: move to dns4002 for now [dns] - 10https://gerrit.wikimedia.org/r/837730 (https://phabricator.wikimedia.org/T319215) [17:19:45] (03CR) 10Cwhite: [C: 03+1] Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [17:21:49] (03PS1) 10Matthias Mullie: Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837731 (https://phabricator.wikimedia.org/T306883) [17:22:58] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) (owner: 10Ssingh) [17:22:58] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:07] (03CR) 10BBlack: [C: 03+2] ntp.ulsfo: move to dns4002 for now [dns] - 10https://gerrit.wikimedia.org/r/837730 (https://phabricator.wikimedia.org/T319215) (owner: 10BBlack) [17:23:34] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) (owner: 10Ssingh) [17:24:38] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 66 probes of 777 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:25:25] (03Merged) 10jenkins-bot: sites.yaml: remove dns4001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/837727 (https://phabricator.wikimedia.org/T319215) (owner: 10Ssingh) [17:29:03] !log running homer "cr*-ulsfo*" commit "Gerrit 837727: remove dns4001 for anycast neighbors." [17:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:32] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [17:32:20] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:33:26] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH) [17:33:31] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH) a:05RobH→03BBlack [17:34:38] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 82, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:37:48] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 777 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:37:53] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:40:10] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:42] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns4003 [17:41:58] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:41:58] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns4003 [17:42:41] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [17:43:06] 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10wiki_willy) a:05wiki_willy→03RobH [17:44:16] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:52:10] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:00:30] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:04:12] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:26] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:06:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall) 05In progress→03Resolved [18:06:11] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10BCornwall) [18:06:12] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:12:47] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:19:00] PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:08] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:21:56] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:30:05] (03PS1) 10RobH: adding new dns4003 [puppet] - 10https://gerrit.wikimedia.org/r/837737 (https://phabricator.wikimedia.org/T317247) [18:30:24] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4045.ulsfo.wmnet with OS buster [18:30:53] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns4003.mgmt.ulsfo.wmnet with reboot policy FORCED [18:30:57] (03CR) 10RobH: [C: 03+2] adding new dns4003 [puppet] - 10https://gerrit.wikimedia.org/r/837737 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [18:34:58] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye [18:35:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye [18:35:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) Further updates on this thread: 1. The installation attempts and debugging above were on **bullseye**, but our cp puppetization is actually still... [18:41:58] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS bullseye [18:42:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors:... [18:45:57] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) I see our buster actually has `linux-image-5.10.0-0.deb10.17-amd64` available in its repos. It may just be a matter of figuring out how to launch... [18:48:58] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye [18:49:05] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye [18:51:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:56:01] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:57:59] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [18:58:56] (03CR) 10CDanis: [C: 03+1] "lgtm as long as the numeric uid isn't changing" [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff) [19:03:20] RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms [19:04:00] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.32 ms [19:09:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:09:13] ryankemper: ^ [19:09:33] gehel: looking [19:10:30] (03PS1) 10Zabe: vcl: stop overriding cache-control header for bad title errors [puppet] - 10https://gerrit.wikimedia.org/r/837742 (https://phabricator.wikimedia.org/T316932) [19:12:39] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) I've also found some other breadcrumbs. Runtime buster + 5.10 support is puppetized in `modules/profile/manifests/base/linux510.pp`. There's ins... [19:15:46] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS bullseye [19:15:50] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors: - dns4003 (**FAIL**)... [19:20:48] 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10RobH) p:05Triage→03Medium [19:22:17] !log [Elastic] Banned `elastic1066` (`curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic1066-production-search-psi-eqiad"}}}'`); will restart elasticsearch-psi after shards drain [19:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:01] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:24:06] PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:24:06] PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:24:18] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:25:11] crap thats me [19:25:16] i forgot to hit enter on log [19:25:32] !log msw1-ulsfo swap, some mgmt flapping expected, swap complete but not powered back up yet [19:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:36] PROBLEM - Host cr3-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:25:48] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:25:48] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:25:54] PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:25:54] PROBLEM - Host cp4033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:27:32] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:27:46] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:56] PROBLEM - Host ganeti4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:27:56] PROBLEM - Host ganeti4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:30:34] ok, they shoudl start coming bakvc [19:30:57] RECOVERY - Host cp4033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [19:32:05] !log msw1-ulsfo swap successful, mgmt recovering in icinga and tested connection with 3 servers all work [19:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:32:59] RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms [19:32:59] RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.51 ms [19:34:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:35:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.494 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:35:49] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.32 ms [19:35:49] RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.23 ms [19:36:05] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.84 ms [19:36:11] RECOVERY - Host cr3-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.17 ms [19:36:13] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.30 ms [19:36:13] RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms [19:36:21] RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.28 ms [19:37:04] !log [Elastic] Restarted psi on `elastic1066`; will unban host after process is up and running [19:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:49] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:25] !log [Elastic] Unbanned `elastic1066` [19:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:58:37] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:23] Indeed, nothing in the queue [20:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:21:27] 10SRE, 10MediaWiki-Uploading, 10MW-1.37-notes, 10MW-1.38-notes, and 4 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) 05Open→03Resolved [20:28:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:41:57] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) The bits for the reimage cookbooks are trivial to do, Spicerack has already support for custom images, see the `media_type` argument to https://do... [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221003T2100). [21:07:37] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:47] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye [21:18:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye [21:18:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:27:25] (03PS5) 10SBassett: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [21:27:58] (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [21:33:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:42:09] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) dns4003 is getting stuck in the reimage at: ` 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. 100.0% (1/1) succes... [21:44:00] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS bullseye [21:44:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors: - dns4003 (**FAIL**)... [21:44:31] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:45:44] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:51:20] (03PS1) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [21:54:59] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [22:18:56] (03CR) 10Ladsgroup: [C: 03+1] admin: Revoke my ssh key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/837079 (owner: 10Ladsgroup) [22:21:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:26:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:51:57] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:31:01] (03PS1) 10Stang: throttle: Add throttle rule for 2022-10-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837756 (https://phabricator.wikimedia.org/T319244) [23:36:56] (03PS1) 10Stang: ukwiki: Create flood group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837757 (https://phabricator.wikimedia.org/T319243) [23:53:17] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:58:11] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Add profiling for Varnish and VCL - https://phabricator.wikimedia.org/T175710 (10Krinkle) 05Open→03Declined I'm declining this as I no longer believe this is an important need for the original objective. I think Varnish is sufficiently standalone and s...