[00:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:04:59] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:10:16] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:11:18] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [00:14:47] (03PS1) 10RobH: cp4045 spare role set [puppet] - 10https://gerrit.wikimedia.org/r/836955 (https://phabricator.wikimedia.org/T317244) [00:15:25] (03CR) 10RobH: [C: 03+2] cp4045 spare role set [puppet] - 10https://gerrit.wikimedia.org/r/836955 (https://phabricator.wikimedia.org/T317244) (owner: 10RobH) [00:19:17] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [00:22:13] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye [00:22:21] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye [00:24:59] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:31:11] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) cp4045 failing to pxe boot. it could be firmware issue, as the NIC came with 6.x firmware. I'll have to mess with rolling it back tomorrow (Friday) ` PXELI... [00:31:21] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS bullseye [00:31:27] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**F... [01:11:26] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:18:02] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:08] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:21] 10SRE, 10serviceops, 10PHP 7.2 support, 10Performance Issue: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Reedy) Is there anything further to do on this? Or can it be closed due to the backports above, and the bump to PHP 7.4? [02:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:01:48] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:05:16] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:32] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:38] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T314041)', diff saved to https://phabricator.wikimedia.org/P35195 and previous config saved to /var/cache/conftool/dbconfig/20220930-033356-ladsgroup.json [03:34:01] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:49:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P35196 and previous config saved to /var/cache/conftool/dbconfig/20220930-034903-ladsgroup.json [03:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:04:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P35197 and previous config saved to /var/cache/conftool/dbconfig/20220930-040409-ladsgroup.json [04:15:34] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T314041)', diff saved to https://phabricator.wikimedia.org/P35198 and previous config saved to /var/cache/conftool/dbconfig/20220930-041916-ladsgroup.json [04:19:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [04:19:20] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:19:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [04:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T314041)', diff saved to https://phabricator.wikimedia.org/P35199 and previous config saved to /var/cache/conftool/dbconfig/20220930-041937-ladsgroup.json [04:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:49:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap.cfg.erb: 7.2 -> 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836932 (https://phabricator.wikimedia.org/T271736) (owner: 10Ahmon Dancy) [04:52:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835234 (owner: 10Zabe) [04:53:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: allow removing a php version from a running system [puppet] - 10https://gerrit.wikimedia.org/r/836783 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [05:00:07] (03PS16) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [05:02:33] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:05:02] (03PS1) 10Marostegui: db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836981 [05:05:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126', diff saved to https://phabricator.wikimedia.org/P35200 and previous config saved to /var/cache/conftool/dbconfig/20220930-050533-root.json [05:05:39] (03CR) 10Marostegui: [C: 03+2] db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836981 (owner: 10Marostegui) [05:10:21] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) @Jclark-ctr did Dell come back to you with any update on how to do next? [05:12:01] (03PS1) 10Marostegui: db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836982 [05:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P35201 and previous config saved to /var/cache/conftool/dbconfig/20220930-051206-root.json [05:12:42] (03PS3) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) [05:12:44] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php::absented_version: also remove systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/836983 [05:12:46] (03CR) 10Marostegui: [C: 03+2] db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836982 (owner: 10Marostegui) [05:13:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35202 and previous config saved to /var/cache/conftool/dbconfig/20220930-051309-root.json [05:15:30] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php::absented_version: also remove systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/836983 [05:15:33] (03PS4) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) [05:16:36] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37396/console" [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [05:18:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php::absented_version: also remove systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/836983 (owner: 10Giuseppe Lavagetto) [05:19:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35203 and previous config saved to /var/cache/conftool/dbconfig/20220930-051919-root.json [05:19:33] (03PS1) 10Marostegui: Revert "db1166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836725 [05:19:40] (03PS1) 10Marostegui: Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836986 [05:20:17] (03CR) 10Marostegui: [C: 03+2] Revert "db1166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836725 (owner: 10Marostegui) [05:20:27] (03CR) 10Marostegui: [C: 03+2] Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836986 (owner: 10Marostegui) [05:20:43] (03PS17) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [05:20:45] _joe_: can I merge your change? [05:20:58] <_joe_> sigh, yes [05:21:11] _joe_: done! [05:23:21] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:27:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [05:28:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35204 and previous config saved to /var/cache/conftool/dbconfig/20220930-052814-root.json [05:29:05] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35206 and previous config saved to /var/cache/conftool/dbconfig/20220930-053424-root.json [05:43:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35207 and previous config saved to /var/cache/conftool/dbconfig/20220930-054319-root.json [05:49:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35208 and previous config saved to /var/cache/conftool/dbconfig/20220930-054929-root.json [05:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35209 and previous config saved to /var/cache/conftool/dbconfig/20220930-055824-root.json [06:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35210 and previous config saved to /var/cache/conftool/dbconfig/20220930-060434-root.json [06:04:41] (03CR) 10Marostegui: [C: 03+2] Add Cumin alias for mariadb objectstash [puppet] - 10https://gerrit.wikimedia.org/r/836805 (owner: 10Muehlenhoff) [06:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35211 and previous config saved to /var/cache/conftool/dbconfig/20220930-061329-root.json [06:17:27] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:19:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35212 and previous config saved to /var/cache/conftool/dbconfig/20220930-061939-root.json [06:25:03] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:28:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35213 and previous config saved to /var/cache/conftool/dbconfig/20220930-062834-root.json [06:34:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35214 and previous config saved to /var/cache/conftool/dbconfig/20220930-063444-root.json [06:43:27] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on piwik/matomo [puppet] - 10https://gerrit.wikimedia.org/r/836859 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:43:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35215 and previous config saved to /var/cache/conftool/dbconfig/20220930-064339-root.json [06:48:23] (03PS4) 10Muehlenhoff: Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792 [06:49:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35216 and previous config saved to /var/cache/conftool/dbconfig/20220930-064949-root.json [06:52:39] (03CR) 10Muehlenhoff: [C: 03+2] Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792 (owner: 10Muehlenhoff) [06:53:29] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for FPM/LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/836697 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35217 and previous config saved to /var/cache/conftool/dbconfig/20220930-065844-root.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220930T0700) [07:04:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35218 and previous config saved to /var/cache/conftool/dbconfig/20220930-070454-root.json [07:04:59] (03PS1) 10Elukey: knative-serving: allow dnsConfig settings for autoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/837069 (https://phabricator.wikimedia.org/T318814) [07:10:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [07:13:40] (03CR) 10Elukey: [C: 03+2] knative-serving: allow dnsConfig settings for autoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/837069 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [07:17:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934 [07:18:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:19:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:21:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 52320 [07:21:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 52320 [07:23:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36692 [07:25:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:26:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:27:14] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:27:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36692 [07:27:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:37:40] !log add RPKI ROAs for 185.71.138.0/24 and 2001:67c:930::/48 [07:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:38] (03PS1) 10Muehlenhoff: bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070 [07:39:40] (03PS1) 10Muehlenhoff: k8s::apiserver: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837071 [07:39:42] (03PS1) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 [07:41:00] (03CR) 10CI reject: [V: 04-1] netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff) [07:45:11] (03PS4) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) [07:45:13] (03PS1) 10Elukey: admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814) [07:46:32] (03PS2) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 [07:51:07] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) [07:57:37] (03CR) 10Ayounsi: [C: 03+1] "This LGTM, but iirc that was a Chris original™️." [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff) [08:06:01] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) [08:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:31:29] (03PS1) 10Hashar: Add .gitreview [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 [08:32:45] (03CR) 10Hashar: "git-review is a python tool to assist interactions with Gerrit https://docs.opendev.org/opendev/git-review/" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar) [08:34:32] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10dcaro) @Andrew fyi [08:36:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] O:toolforge: block local crontabs on accessible hosts [puppet] - 10https://gerrit.wikimedia.org/r/836258 (owner: 10Majavah) [08:45:21] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) @Jclark-ctr Awesome thanks! We need to schedule a window to do the plugging/unplugging/reconfiguring. Would next Tu... [09:03:36] (03PS5) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) [09:03:38] (03PS2) 10Elukey: admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814) [09:10:01] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Peachey88) [09:13:01] (03CR) 10Klausman: [C: 03+1] admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [09:23:36] (03CR) 10Hashar: "That is quite nice and a very nice addition. I have found a few issues here and there and proposed amendment to extend the documentation. " [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [09:27:39] (03CR) 10Btullis: [C: 03+1] "The change looks good to me, but I'd look at getting a +1 from someone in the ServiceOps team as well." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [09:31:04] (03PS15) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:31:06] (03PS1) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [09:32:43] (03CR) 10David Caro: "Just rebased this on top of latest, there were some changes to the file." [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:33:05] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:34:01] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro) [09:38:10] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: introduce workaround for debian bug #989162 [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) [09:39:05] (03PS2) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 [09:39:20] (03PS1) 10Ladsgroup: admin: Revoke my ssh key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/837079 [09:39:41] (03CR) 10Ladsgroup: [C: 04-2] "not yet" [puppet] - 10https://gerrit.wikimedia.org/r/837079 (owner: 10Ladsgroup) [09:39:59] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: introduce workaround for debian bug #989162 [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) [09:40:09] (03PS3) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 [09:42:11] !log installing Linux 5.10.140 updates on Bullseye hosts (released via 11.5 point release), just rollout of the package, no reboots involved [09:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:48] (03PS4) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 [09:45:43] (03PS5) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 [09:46:55] (03CR) 10Clément Goubert: "Thanks!" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [09:48:48] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar) [09:49:41] (03CR) 10Btullis: [C: 03+2] Remove duplicate YAML hash from releases hieradata [puppet] - 10https://gerrit.wikimedia.org/r/830569 (owner: 10Btullis) [09:53:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but I didn't check if the CREATE TABLE instruction would succeed or not." [puppet] - 10https://gerrit.wikimedia.org/r/836849 (https://phabricator.wikimedia.org/T318047) (owner: 10David Caro) [09:54:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T314041)', diff saved to https://phabricator.wikimedia.org/P35219 and previous config saved to /var/cache/conftool/dbconfig/20220930-095423-ladsgroup.json [09:54:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:56:48] (03CR) 10Arturo Borrero Gonzalez: "Did you consider hosting the deployment information + configuration values in the same repo as the source code? And then have a ./deploy.s" [puppet] - 10https://gerrit.wikimedia.org/r/743574 (https://phabricator.wikimedia.org/T292925) (owner: 10Majavah) [09:57:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rename labs and cloud filters [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi) [09:58:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "perhaps this is no longer necessary?" [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [10:07:51] (03CR) 10Hashar: "You can go ahead and CR+2 / V+2 and submit the change, I don't have permissions on this repo ;)" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar) [10:09:11] (03CR) 10Hnowlan: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [10:09:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P35220 and previous config saved to /var/cache/conftool/dbconfig/20220930-100930-ladsgroup.json [10:11:53] (03Abandoned) 10Majavah: toolforge: provision delete-crashing-pods values [puppet] - 10https://gerrit.wikimedia.org/r/743574 (https://phabricator.wikimedia.org/T292925) (owner: 10Majavah) [10:16:26] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: add missing collate to the account table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836849 (https://phabricator.wikimedia.org/T318047) (owner: 10David Caro) [10:17:57] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:20:13] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:24:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P35221 and previous config saved to /var/cache/conftool/dbconfig/20220930-102436-ladsgroup.json [10:27:29] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I will discuss it within infra foundations however, in case this is something we wish to do across all systems or not." [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [10:28:18] (03CR) 10Hashar: doc: add README.md (034 comments) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [10:28:23] (03PS6) 10Hashar: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [10:28:43] (03PS1) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) [10:30:03] (03PS2) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) [10:35:46] (03CR) 10Jcrespo: "Context: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/mariadb/dbstore_multiinsta" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [10:39:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T314041)', diff saved to https://phabricator.wikimedia.org/P35222 and previous config saved to /var/cache/conftool/dbconfig/20220930-103943-ladsgroup.json [10:39:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [10:39:47] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:39:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [10:40:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T314041)', diff saved to https://phabricator.wikimedia.org/P35223 and previous config saved to /var/cache/conftool/dbconfig/20220930-104004-ladsgroup.json [10:43:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: introduce workaround for debian bug #989162 [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [10:43:51] (03CR) 10Muehlenhoff: "Why not simply rebuild the bridge-utils deb?" [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [10:44:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/37397/" [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [10:45:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: introduce workaround for debian bug #989162 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [10:46:55] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10MoritzMuehlenhoff) Traffic folks, can be please go ahead and fully decom cp5001, then? Right now this is in a weird limbo state between debmonitor/puppetdb/Netbox. [10:47:06] (03CR) 10Hnowlan: [C: 03+1] "lgtm, nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/836790 (owner: 10Muehlenhoff) [10:53:00] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Sorry yes. Dell is shipping out another memory stick waiting on part right now [11:00:24] (03CR) 10Muehlenhoff: openstack: neutron: introduce workaround for debian bug #989162 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [11:00:31] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Add .gitreview [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar) [11:01:44] (03CR) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [11:01:56] (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [11:06:31] (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [11:08:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET namespaces) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:09:34] (03PS7) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 [11:11:46] (03CR) 10Clément Goubert: doc: add README.md (031 comment) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [11:13:28] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [11:13:58] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST jobs) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:03] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Great thank you. The host is off, so please feel free to replace it whenever you like. [11:15:57] PROBLEM - Disk space on ganeti6002 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti6002&var-datasource=drmrs+prometheus/ops [11:16:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host puppetdb-test2001.codfw.wmnet [11:16:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:21:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:21:39] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache puppetdb-test2001.codfw.wmnet on all recursors [11:21:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetdb-test2001.codfw.wmnet on all recursors [11:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P35224 and previous config saved to /var/cache/conftool/dbconfig/20220930-112307-root.json [11:25:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Lucas_Werkmeister_WMDE) [11:25:53] Hi, I am getting some failures from parsoid on deployment-prep that affects restbase tests. Here is the ticket: https://phabricator.wikimedia.org/T319009 What would be the right channel to reach out to? [11:25:58] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [11:27:03] nemo-yiannis: I would try #wikimedia-releng (wikibugs is already posting updates to the task there due to the relevant tags) [11:27:15] Thanks Lucas_WMDE [11:29:45] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more sysctl fine-tuning [puppet] - 10https://gerrit.wikimedia.org/r/837088 (https://phabricator.wikimedia.org/T318824) [11:31:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35225 and previous config saved to /var/cache/conftool/dbconfig/20220930-113101-root.json [11:37:13] RECOVERY - Disk space on ganeti6002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti6002&var-datasource=drmrs+prometheus/ops [11:41:46] (03PS2) 10ArielGlenn: snapshot: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup) [11:42:33] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup) [11:43:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! I think we may not need to disable the rp_filter on the physical, but it won't make a difference, as with the default route facing " [puppet] - 10https://gerrit.wikimedia.org/r/837088 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [11:44:09] (03PS2) 10Muehlenhoff: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 [11:45:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: l3_agent: more sysctl fine-tuning [puppet] - 10https://gerrit.wikimedia.org/r/837088 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [11:45:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [11:46:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35226 and previous config saved to /var/cache/conftool/dbconfig/20220930-114605-root.json [11:46:25] (03PS1) 10ArielGlenn: tiny whitespace fix in sql/xml dumps tables list [puppet] - 10https://gerrit.wikimedia.org/r/837089 [11:51:41] (03CR) 10Hokwelum: [C: 03+1] tiny whitespace fix in sql/xml dumps tables list [puppet] - 10https://gerrit.wikimedia.org/r/837089 (owner: 10ArielGlenn) [11:52:12] (03CR) 10ArielGlenn: [C: 03+2] tiny whitespace fix in sql/xml dumps tables list [puppet] - 10https://gerrit.wikimedia.org/r/837089 (owner: 10ArielGlenn) [11:54:21] (03CR) 10Hnowlan: thumbor: new service chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:55:56] (03PS4) 10ArielGlenn: remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 (https://phabricator.wikimedia.org/T318894) (owner: 10Hokwelum) [11:57:28] (03CR) 10ArielGlenn: [C: 03+2] remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 (https://phabricator.wikimedia.org/T318894) (owner: 10Hokwelum) [11:59:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetdb-test2001.codfw.wmnet [12:01:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35227 and previous config saved to /var/cache/conftool/dbconfig/20220930-120113-root.json [12:07:19] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [12:08:12] (03PS1) 10Muehlenhoff: mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) [12:08:14] (03PS1) 10Muehlenhoff: ldap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) [12:08:16] (03PS1) 10Muehlenhoff: docker_registry/imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) [12:08:18] (03PS1) 10Muehlenhoff: tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013) [12:08:20] (03PS1) 10Muehlenhoff: alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837097 (https://phabricator.wikimedia.org/T308013) [12:08:22] (03PS1) 10Muehlenhoff: dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) [12:09:35] (03PS1) 10Muehlenhoff: Add DHCP entry for puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/837099 (https://phabricator.wikimedia.org/T318931) [12:09:48] (03PS2) 10Muehlenhoff: mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) [12:10:09] (03PS2) 10Muehlenhoff: ldap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) [12:10:23] (03PS2) 10Muehlenhoff: docker_registry/imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) [12:11:36] (03CR) 10CI reject: [V: 04-1] ldap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:12:41] (03CR) 10CI reject: [V: 04-1] docker_registry/imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:16:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35228 and previous config saved to /var/cache/conftool/dbconfig/20220930-121618-root.json [12:16:25] (03PS3) 10Muehlenhoff: ldap profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) [12:16:42] (03PS3) 10Muehlenhoff: docker_registry/imagecatalog profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) [12:17:03] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entry for puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/837099 (https://phabricator.wikimedia.org/T318931) (owner: 10Muehlenhoff) [12:23:46] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for FPM on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/837101 (https://phabricator.wikimedia.org/T135991) [12:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:26:11] (03PS2) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) [12:29:06] (03CR) 10Muehlenhoff: [C: 03+2] ldap profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:29:14] (03PS4) 10Muehlenhoff: ldap profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) [12:31:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35229 and previous config saved to /var/cache/conftool/dbconfig/20220930-123123-root.json [12:32:03] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:35:18] (03PS4) 10Muehlenhoff: docker_registry/imagecatalog profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) [12:37:57] (03PS4) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) [12:39:01] (03CR) 10Muehlenhoff: [C: 03+2] docker_registry/imagecatalog profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:40:33] PROBLEM - Check systemd state on elastic1096 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-omega-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:17] (03PS2) 10Muehlenhoff: snapshot: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/837101 [12:46:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35230 and previous config saved to /var/cache/conftool/dbconfig/20220930-124628-root.json [12:47:56] (03PS1) 10Muehlenhoff: Use correct auto restart define [puppet] - 10https://gerrit.wikimedia.org/r/837104 [12:48:11] (03PS1) 10Esanders: Enable DiscussionTools mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837105 (https://phabricator.wikimedia.org/T317467) [12:49:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Use correct auto restart define [puppet] - 10https://gerrit.wikimedia.org/r/837104 (owner: 10Muehlenhoff) [12:50:15] (03PS5) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) [12:50:16] (03PS1) 10BBlack: Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963) [12:51:20] (03CR) 10Muehlenhoff: [C: 03+2] Use correct auto restart define [puppet] - 10https://gerrit.wikimedia.org/r/837104 (owner: 10Muehlenhoff) [12:51:29] (03CR) 10CI reject: [V: 04-1] Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963) (owner: 10BBlack) [12:53:01] (03PS2) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [12:53:03] (03CR) 10BBlack: cache node disk layout p11n for F4 config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack) [12:54:20] (03CR) 10BBlack: [C: 03+2] cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack) [12:55:44] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro) [12:56:23] (03PS2) 10BBlack: Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963) [12:57:30] (03CR) 10BBlack: [C: 03+2] Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963) (owner: 10BBlack) [13:01:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35231 and previous config saved to /var/cache/conftool/dbconfig/20220930-130133-root.json [13:02:59] 10SRE, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4021 &n cp4027 - https://phabricator.wikimedia.org/T318963 (10BBlack) [13:03:19] 10SRE, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4021 &n cp4027 - https://phabricator.wikimedia.org/T318963 (10BBlack) a:05BBlack→03RobH >>! In T318963#8274300, @RobH wrote: > Brandon, > > Both of these hosts have had the decom script run, but they still have references in... [13:05:36] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) @ayounsi is there a time window you prefer? I can be available 1pm UTC time I am available any day. [13:06:01] RECOVERY - Check systemd state on elastic1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:35] 10SRE, 10Observability-Logging, 10Observability-Metrics, 10serviceops, and 2 others: Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10CDanis) Just pinging this task as OKR season is upon us and this might be a useful and fun thing to sneak... [13:12:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s::apiserver: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff) [13:13:19] (03PS1) 10Muehlenhoff: Add puppetdb-test2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/837110 (https://phabricator.wikimedia.org/T318931) [13:13:35] (03PS16) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:13:37] (03PS3) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [13:14:23] (03PS2) 10Muehlenhoff: Add puppetdb-test2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/837110 (https://phabricator.wikimedia.org/T318931) [13:14:35] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:14:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:14:50] (03CR) 10Elukey: [C: 03+2] coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [13:15:43] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:16:00] (03PS17) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:16:02] (03PS4) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [13:16:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35232 and previous config saved to /var/cache/conftool/dbconfig/20220930-131638-root.json [13:17:45] (03CR) 10Muehlenhoff: [C: 03+2] Add puppetdb-test2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/837110 (https://phabricator.wikimedia.org/T318931) (owner: 10Muehlenhoff) [13:18:35] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:19:05] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:05] (03PS1) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) [13:19:29] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro) [13:19:42] (03CR) 10Elukey: [C: 03+2] admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey) [13:19:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:20:45] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:22:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:22:13] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:22:26] (03PS18) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:22:28] (03PS5) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [13:22:58] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:23:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:23:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:23:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:23:29] (03PS1) 10Kosta Harlan: Remove GEHomepageImpactModuleEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 [13:24:01] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [13:25:35] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro) [13:26:14] (03PS2) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) [13:27:43] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:31:23] (03PS6) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [13:33:17] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:34:12] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro) [13:38:08] (03CR) 10FNegri: "I have verified this is now working correctly by re-running the cookbook on a host that was already set up:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [13:44:20] (03PS1) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) [13:44:43] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:17] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [13:47:11] (03PS1) 10JMeybohm: Disable zipkin and tracing for wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837117 (https://phabricator.wikimedia.org/T318814) [13:47:13] (03PS1) 10JMeybohm: Enable additional envoy native metrics in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/837118 [13:47:41] (03CR) 10David Caro: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [13:51:53] !log installing puppetdb-test2001 T318931 [13:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:58] T318931: codfw: 1 VMs requested for puppetdb-test2001 - https://phabricator.wikimedia.org/T318931 [13:52:05] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:52:19] (03PS2) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) [13:53:55] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37399/console" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [13:57:12] (03PS1) 10Muehlenhoff: mariadb::stock_heartbeat: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837120 [13:57:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:59:08] (03PS3) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) [13:59:37] (03CR) 10Hashar: [C: 03+1] doc: add README.md (031 comment) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [13:59:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:00:19] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37400/console" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [14:01:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37401/console" [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff) [14:03:02] (03CR) 10JMeybohm: [C: 03+1] "I would be chicken and stop puppet on multiple masters before merging this, but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff) [14:03:14] (03PS1) 10Muehlenhoff: openstack::monitor::networktests: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837121 [14:03:16] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] k8s::apiserver: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff) [14:08:25] (03PS19) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:08:27] (03PS7) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 [14:08:29] (03PS1) 10David Caro: flake8: Several pep8/flake8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/837126 [14:09:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:12:52] (03CR) 10Elukey: Disable zipkin and tracing for wikikube clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837117 (https://phabricator.wikimedia.org/T318814) (owner: 10JMeybohm) [14:14:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:15:30] (03CR) 10Elukey: [C: 03+1] Enable additional envoy native metrics in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/837118 (owner: 10JMeybohm) [14:23:47] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10nskaggs) I think dupe of T319025 [14:26:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:26:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:28:49] 10SRE, 10vm-requests: codfw: 1 VMs requested for puppetdb-test2001 - https://phabricator.wikimedia.org/T318931 (10MoritzMuehlenhoff) 05Open→03Resolved puppetdb-test2001 has been created and installed. [14:29:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:30:25] (03CR) 10Urbanecm: [C: 03+1] "LGTM. Likely can be merged even before wmf.4 lands, as we're 100% on true anyway. Thanks for making Growth in IS.php shorter! 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan) [14:38:29] (03PS1) 10Andrew Bogott: alerts.downtime_host: add a wildcard to the end of the hostname [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [14:40:00] (03CR) 10David Caro: [C: 03+1] alerts.downtime_host: add a wildcard to the end of the hostname (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [14:41:00] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [14:43:12] (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: add a wildcard to the end of the hostname [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [14:43:49] (03CR) 10Nskaggs: alerts.downtime_host: add a wildcard to the end of the hostname (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [14:45:38] (03CR) 10Andrew Bogott: alerts.downtime_host: add a wildcard to the end of the hostname (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [14:49:34] (03PS2) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [14:51:08] (03PS3) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [14:55:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:21] (03CR) 10Volans: "FYI inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [14:58:18] (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [15:01:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [15:02:22] (03PS3) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) [15:02:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:04:12] RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:50] (03CR) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [15:16:31] (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [15:16:35] (03PS5) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) [15:17:38] (03PS5) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) [15:19:19] (03PS5) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [15:20:03] (03CR) 10David Caro: alerts.downtime_host: attempt to match alert hostnames with : (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [15:20:11] (03CR) 10CI reject: [V: 04-1] maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:20:32] (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [15:21:57] (03PS6) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [15:27:46] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:32:40] (03PS4) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) [15:34:00] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37402/console" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [15:34:23] 10SRE, 10Traffic, 10decommission-hardware: decommission cp4021 &n cp4027 - https://phabricator.wikimedia.org/T318963 (10RobH) 05Open→03Resolved [15:34:25] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [15:37:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye [15:44:44] 10SRE, 10Infrastructure-Foundations, 10netops: Q4: esams atlas anchor - https://phabricator.wikimedia.org/T307021 (10RobH) [15:45:23] 10SRE, 10Infrastructure-Foundations, 10netops: Q4: esams atlas anchor - https://phabricator.wikimedia.org/T307021 (10RobH) [15:45:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:50:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:03] (03CR) 10David Caro: "For irl chat:" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:05:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:37] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Andrew) [16:10:26] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew) [16:15:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:20:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T314041)', diff saved to https://phabricator.wikimedia.org/P35233 and previous config saved to /var/cache/conftool/dbconfig/20220930-162027-ladsgroup.json [16:20:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:20:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:26:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:31:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:32:05] (03CR) 10BryanDavis: "I would like to know more about how the notification system failed before abandoning the idea of the purge script. See T247517#8211187 for" [puppet] - 10https://gerrit.wikimedia.org/r/829231 (https://phabricator.wikimedia.org/T247517) (owner: 10Andrew Bogott) [16:34:07] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [16:35:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P35234 and previous config saved to /var/cache/conftool/dbconfig/20220930-163533-ladsgroup.json [16:50:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P35235 and previous config saved to /var/cache/conftool/dbconfig/20220930-165040-ladsgroup.json [16:54:22] !log bblack@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye [16:54:30] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye [16:54:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10AnnWF) [17:05:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T314041)', diff saved to https://phabricator.wikimedia.org/P35236 and previous config saved to /var/cache/conftool/dbconfig/20220930-170546-ladsgroup.json [17:05:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:05:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:06:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:06:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T314041)', diff saved to https://phabricator.wikimedia.org/P35237 and previous config saved to /var/cache/conftool/dbconfig/20220930-170620-ladsgroup.json [17:16:44] (03CR) 10Samtar: "I'll be the first to admit that I'm not only unsure if this is needed, but I don't fully understand what it does — I'll mark this for revi" [puppet] - 10https://gerrit.wikimedia.org/r/837107 (https://phabricator.wikimedia.org/T317417) (owner: 10Samtar) [17:17:16] ^ "It's only puppet, what could go wrong?" :D [17:18:19] Friday evening, the best time for random puppet changes [17:20:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg) [17:24:51] !log bblack@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cp4045.ulsfo.wmnet with OS bullseye [17:24:55] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**FAIL**) - Removed f... [17:25:00] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) cp4045 firmware inventory: bios is newest 1.6.5 10G nic is 22.00.07.60 , downgrading to 21.85.21.92 idrac is 5.10.30.00, cap at this and won't upgrade to 6.x which breaks http... [17:26:16] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [17:28:36] (03PS4) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [17:29:01] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319057 (10Damilare) [17:29:14] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare) [17:29:19] (03CR) 10Wctaiwan: [C: 03+1] "Translations look good." [puppet] - 10https://gerrit.wikimedia.org/r/816161 (owner: 10Diskdance) [17:32:12] (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [17:33:23] (03PS5) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [17:37:33] (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [17:43:45] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye [17:43:50] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye [18:01:20] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS bullseye [18:01:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**FAIL**) - Removed fro... [18:08:54] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye [18:08:58] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye [18:19:46] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: add hbs330 support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) [18:22:43] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) [18:22:50] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) {F35541613} The last time I had an issue with driver support in the installer, I recall @MoritzMuehlenhoff being the person to help me out. Moritz is this still the case, and are... [18:23:26] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) [18:23:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) a:05RobH→03MoritzMuehlenhoff [18:30:15] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS bullseye [18:30:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**FAIL**) - Removed fro... [18:35:01] (03PS6) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [18:35:07] (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [18:48:12] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Jdforrester-WMF) [19:03:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10AnnWF) [19:21:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Aklapper) [Please don't copy some existing task. Please use the proper template and make sure the template is linked from a potential team onboarding doc. Thank you!] [19:31:36] (03PS1) 10Ebernhardson: Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180 [19:33:36] (03CR) 10Marostegui: [C: 03+1] mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [19:34:37] (03CR) 10Jcrespo: "This didn't work,import failed again :-(" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [19:37:54] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) I did a little digging from the `install_console` shell on this host. lspci output for this adapter is: ` ~ # lspci -v -s 65:00.0 -nn 65:00.0 Ser... [19:52:23] (03PS1) 10Jdlrobson: Fix page toolbar border [skins/Vector] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836993 (https://phabricator.wikimedia.org/T318952) [20:24:02] (03CR) 10Volans: [C: 03+1] "LGTM, although I'm not familiar with the scripts it's trivial enough." [puppet] - 10https://gerrit.wikimedia.org/r/837126 (owner: 10David Caro) [20:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:30:46] (03CR) 10Volans: "reply inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [20:32:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:37:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:38:10] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [20:54:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2001.codfw.wmnet [20:55:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare) [20:56:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare) [20:57:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare) [20:59:10] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) Update: cp4037 is racked, but I had to steal its optic for T280202, since its cp4021 was busted anyhow. cp4045 is racked and accessible, but we've run into an installer issue on its insta... [21:02:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2001.codfw.wmnet [21:11:16] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Peachey88) [21:17:22] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:43:30] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:07:13] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Was just notified by data center of delivery from dell. [22:18:32] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:40:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T314041)', diff saved to https://phabricator.wikimedia.org/P35240 and previous config saved to /var/cache/conftool/dbconfig/20220930-224027-ladsgroup.json [22:40:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:43:40] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:55:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P35241 and previous config saved to /var/cache/conftool/dbconfig/20220930-225534-ladsgroup.json [23:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P35242 and previous config saved to /var/cache/conftool/dbconfig/20220930-231040-ladsgroup.json [23:25:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T314041)', diff saved to https://phabricator.wikimedia.org/P35243 and previous config saved to /var/cache/conftool/dbconfig/20220930-232546-ladsgroup.json [23:25:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:25:51] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:26:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:37:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:42:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:44:52] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:00] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook