[00:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:04:59] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:07:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:10:16] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:11:18] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[00:14:47] <wikibugs>	 (03PS1) 10RobH: cp4045 spare role set [puppet] - 10https://gerrit.wikimedia.org/r/836955 (https://phabricator.wikimedia.org/T317244)
[00:15:25] <wikibugs>	 (03CR) 10RobH: [C: 03+2] cp4045 spare role set [puppet] - 10https://gerrit.wikimedia.org/r/836955 (https://phabricator.wikimedia.org/T317244) (owner: 10RobH)
[00:19:17] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[00:22:13] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye
[00:22:21] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye
[00:24:59] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:31:11] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) cp4045 failing to pxe boot.  it could be firmware issue, as the NIC came with 6.x firmware.  I'll have to mess with rolling it back tomorrow (Friday) ` PXELI...
[00:31:21] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS bullseye
[00:31:27] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**F...
[01:11:26] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:18:02] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:58:08] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:21] <wikibugs>	 10SRE, 10serviceops, 10PHP 7.2 support, 10Performance Issue: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Reedy) Is there anything further to do on this? Or can it be closed due to the backports above, and the bump to PHP 7.4?
[02:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:44:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:01:48] <icinga-wm>	 RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:05:16] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:32] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:21:38] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:33:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T314041)', diff saved to https://phabricator.wikimedia.org/P35195 and previous config saved to /var/cache/conftool/dbconfig/20220930-033356-ladsgroup.json
[03:34:01] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[03:45:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:49:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P35196 and previous config saved to /var/cache/conftool/dbconfig/20220930-034903-ladsgroup.json
[03:50:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:04:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P35197 and previous config saved to /var/cache/conftool/dbconfig/20220930-040409-ladsgroup.json
[04:15:34] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:19:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T314041)', diff saved to https://phabricator.wikimedia.org/P35198 and previous config saved to /var/cache/conftool/dbconfig/20220930-041916-ladsgroup.json
[04:19:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[04:19:20] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[04:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[04:19:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T314041)', diff saved to https://phabricator.wikimedia.org/P35199 and previous config saved to /var/cache/conftool/dbconfig/20220930-041937-ladsgroup.json
[04:25:13] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:49:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap.cfg.erb: 7.2 -> 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836932 (https://phabricator.wikimedia.org/T271736) (owner: 10Ahmon Dancy)
[04:52:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835234 (owner: 10Zabe)
[04:53:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: allow removing a php version from a running system [puppet] - 10https://gerrit.wikimedia.org/r/836783 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[05:00:07] <wikibugs>	 (03PS16) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[05:02:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[05:05:02] <wikibugs>	 (03PS1) 10Marostegui: db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836981
[05:05:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126', diff saved to https://phabricator.wikimedia.org/P35200 and previous config saved to /var/cache/conftool/dbconfig/20220930-050533-root.json
[05:05:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836981 (owner: 10Marostegui)
[05:10:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) @Jclark-ctr did Dell come back to you with any update on how to do next?
[05:12:01] <wikibugs>	 (03PS1) 10Marostegui: db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836982
[05:12:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P35201 and previous config saved to /var/cache/conftool/dbconfig/20220930-051206-root.json
[05:12:42] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894)
[05:12:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php::absented_version: also remove systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/836983
[05:12:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/836982 (owner: 10Marostegui)
[05:13:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35202 and previous config saved to /var/cache/conftool/dbconfig/20220930-051309-root.json
[05:15:30] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php::absented_version: also remove systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/836983
[05:15:33] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894)
[05:16:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37396/console" [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[05:18:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php::absented_version: also remove systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/836983 (owner: 10Giuseppe Lavagetto)
[05:19:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35203 and previous config saved to /var/cache/conftool/dbconfig/20220930-051919-root.json
[05:19:33] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836725
[05:19:40] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836986
[05:20:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836725 (owner: 10Marostegui)
[05:20:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/836986 (owner: 10Marostegui)
[05:20:43] <wikibugs>	 (03PS17) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[05:20:45] <marostegui>	 _joe_: can I merge your change?
[05:20:58] <_joe_>	 sigh, yes
[05:21:11] <marostegui>	 _joe_: done!
[05:23:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[05:27:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mwdebug: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/836784 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[05:28:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35204 and previous config saved to /var/cache/conftool/dbconfig/20220930-052814-root.json
[05:29:05] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[05:34:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35206 and previous config saved to /var/cache/conftool/dbconfig/20220930-053424-root.json
[05:43:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35207 and previous config saved to /var/cache/conftool/dbconfig/20220930-054319-root.json
[05:49:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35208 and previous config saved to /var/cache/conftool/dbconfig/20220930-054929-root.json
[05:58:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35209 and previous config saved to /var/cache/conftool/dbconfig/20220930-055824-root.json
[06:04:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35210 and previous config saved to /var/cache/conftool/dbconfig/20220930-060434-root.json
[06:04:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Add Cumin alias for mariadb objectstash [puppet] - 10https://gerrit.wikimedia.org/r/836805 (owner: 10Muehlenhoff)
[06:13:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35211 and previous config saved to /var/cache/conftool/dbconfig/20220930-061329-root.json
[06:17:27] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:19:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35212 and previous config saved to /var/cache/conftool/dbconfig/20220930-061939-root.json
[06:25:03] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:28:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35213 and previous config saved to /var/cache/conftool/dbconfig/20220930-062834-root.json
[06:34:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35214 and previous config saved to /var/cache/conftool/dbconfig/20220930-063444-root.json
[06:43:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on piwik/matomo [puppet] - 10https://gerrit.wikimedia.org/r/836859 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[06:43:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35215 and previous config saved to /var/cache/conftool/dbconfig/20220930-064339-root.json
[06:48:23] <wikibugs>	 (03PS4) 10Muehlenhoff: Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792
[06:49:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35216 and previous config saved to /var/cache/conftool/dbconfig/20220930-064949-root.json
[06:52:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend maps Cumin alias with site-specific equivalents [puppet] - 10https://gerrit.wikimedia.org/r/836792 (owner: 10Muehlenhoff)
[06:53:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for FPM/LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/836697 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[06:58:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35217 and previous config saved to /var/cache/conftool/dbconfig/20220930-065844-root.json
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220930T0700)
[07:04:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35218 and previous config saved to /var/cache/conftool/dbconfig/20220930-070454-root.json
[07:04:59] <wikibugs>	 (03PS1) 10Elukey: knative-serving: allow dnsConfig settings for autoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/837069 (https://phabricator.wikimedia.org/T318814)
[07:10:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32934
[07:13:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: allow dnsConfig settings for autoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/837069 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey)
[07:17:29] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934
[07:18:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[07:19:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[07:21:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 52320
[07:21:51] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 52320
[07:23:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36692
[07:25:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[07:26:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[07:27:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[07:27:48] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36692
[07:27:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[07:37:40] <XioNoX>	 !log add RPKI ROAs for 185.71.138.0/24 and 2001:67c:930::/48
[07:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:38] <wikibugs>	 (03PS1) 10Muehlenhoff: bgpalerter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837070
[07:39:40] <wikibugs>	 (03PS1) 10Muehlenhoff: k8s::apiserver: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837071
[07:39:42] <wikibugs>	 (03PS1) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072
[07:41:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff)
[07:45:11] <wikibugs>	 (03PS4) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814)
[07:45:13] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814)
[07:46:32] <wikibugs>	 (03PS2) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072
[07:51:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe)
[07:57:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "This LGTM, but iirc that was a Chris original™️." [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff)
[08:06:01] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot)
[08:25:13] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:31:29] <wikibugs>	 (03PS1) 10Hashar: Add .gitreview [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074
[08:32:45] <wikibugs>	 (03CR) 10Hashar: "git-review is a python tool to assist interactions with Gerrit https://docs.opendev.org/opendev/git-review/" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar)
[08:34:32] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10dcaro) @Andrew fyi
[08:36:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] O:toolforge: block local crontabs on accessible hosts [puppet] - 10https://gerrit.wikimedia.org/r/836258 (owner: 10Majavah)
[08:45:21] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) @Jclark-ctr Awesome thanks! We need to schedule a window to do the plugging/unplugging/reconfiguring. Would next Tu...
[09:03:36] <wikibugs>	 (03PS5) 10Elukey: coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814)
[09:03:38] <wikibugs>	 (03PS2) 10Elukey: admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814)
[09:10:01] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Peachey88)
[09:13:01] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey)
[09:23:36] <wikibugs>	 (03CR) 10Hashar: "That is quite nice and a very nice addition. I have found a few issues here and there and proposed amendment to extend the documentation. " [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert)
[09:27:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "The change looks good to me, but I'd look at getting a +1 from someone in the ServiceOps team as well." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond)
[09:31:04] <wikibugs>	 (03PS15) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[09:31:06] <wikibugs>	 (03PS1) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[09:32:43] <wikibugs>	 (03CR) 10David Caro: "Just rebased this on top of latest, there were some changes to the file." [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[09:33:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[09:34:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro)
[09:38:10] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: introduce workaround for debian bug #989162 [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824)
[09:39:05] <wikibugs>	 (03PS2) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816
[09:39:20] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Revoke my ssh key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/837079
[09:39:41] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] "not yet" [puppet] - 10https://gerrit.wikimedia.org/r/837079 (owner: 10Ladsgroup)
[09:39:59] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: introduce workaround for debian bug #989162 [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824)
[09:40:09] <wikibugs>	 (03PS3) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816
[09:42:11] <moritzm>	 !log installing Linux 5.10.140 updates on Bullseye hosts (released via 11.5 point release), just rollout of the package, no reboots involved
[09:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:48] <wikibugs>	 (03PS4) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816
[09:45:43] <wikibugs>	 (03PS5) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816
[09:46:55] <wikibugs>	 (03CR) 10Clément Goubert: "Thanks!" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert)
[09:48:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar)
[09:49:41] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove duplicate YAML hash from releases hieradata [puppet] - 10https://gerrit.wikimedia.org/r/830569 (owner: 10Btullis)
[09:53:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but I didn't check if the CREATE TABLE instruction would succeed or not." [puppet] - 10https://gerrit.wikimedia.org/r/836849 (https://phabricator.wikimedia.org/T318047) (owner: 10David Caro)
[09:54:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T314041)', diff saved to https://phabricator.wikimedia.org/P35219 and previous config saved to /var/cache/conftool/dbconfig/20220930-095423-ladsgroup.json
[09:54:28] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[09:56:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Did you consider hosting the deployment information + configuration values in the same repo as the source code? And then have a ./deploy.s" [puppet] - 10https://gerrit.wikimedia.org/r/743574 (https://phabricator.wikimedia.org/T292925) (owner: 10Majavah)
[09:57:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rename labs and cloud filters [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi)
[09:58:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "perhaps this is no longer necessary?" [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond)
[10:07:51] <wikibugs>	 (03CR) 10Hashar: "You can go ahead and CR+2 / V+2 and submit the change, I don't have permissions on this repo ;)" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar)
[10:09:11] <wikibugs>	 (03CR) 10Hnowlan: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[10:09:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P35220 and previous config saved to /var/cache/conftool/dbconfig/20220930-100930-ladsgroup.json
[10:11:53] <wikibugs>	 (03Abandoned) 10Majavah: toolforge: provision delete-crashing-pods values [puppet] - 10https://gerrit.wikimedia.org/r/743574 (https://phabricator.wikimedia.org/T292925) (owner: 10Majavah)
[10:16:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] maintain-dbusers: add missing collate to the account table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836849 (https://phabricator.wikimedia.org/T318047) (owner: 10David Caro)
[10:17:57] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[10:20:13] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:24:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P35221 and previous config saved to /var/cache/conftool/dbconfig/20220930-102436-ladsgroup.json
[10:27:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I will discuss it within infra foundations however, in case this is something we wish to do across all systems or not." [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[10:28:18] <wikibugs>	 (03CR) 10Hashar: doc: add README.md (034 comments) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert)
[10:28:23] <wikibugs>	 (03PS6) 10Hashar: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert)
[10:28:43] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062)
[10:30:03] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062)
[10:35:46] <wikibugs>	 (03CR) 10Jcrespo: "Context: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/mariadb/dbstore_multiinsta" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[10:39:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T314041)', diff saved to https://phabricator.wikimedia.org/P35222 and previous config saved to /var/cache/conftool/dbconfig/20220930-103943-ladsgroup.json
[10:39:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[10:39:47] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:39:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[10:40:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T314041)', diff saved to https://phabricator.wikimedia.org/P35223 and previous config saved to /var/cache/conftool/dbconfig/20220930-104004-ladsgroup.json
[10:43:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: introduce workaround for debian bug #989162 [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[10:43:51] <wikibugs>	 (03CR) 10Muehlenhoff: "Why not simply rebuild the bridge-utils deb?" [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[10:44:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/37397/" [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[10:45:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: introduce workaround for debian bug #989162 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[10:46:55] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10MoritzMuehlenhoff) Traffic folks, can be please go ahead and fully decom cp5001, then? Right now this is in a weird limbo state between debmonitor/puppetdb/Netbox.
[10:47:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm, nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/836790 (owner: 10Muehlenhoff)
[10:53:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Sorry yes. Dell is shipping out another memory stick waiting on part right now
[11:00:24] <wikibugs>	 (03CR) 10Muehlenhoff: openstack: neutron: introduce workaround for debian bug #989162 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837078 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[11:00:31] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Add .gitreview [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/837074 (owner: 10Hashar)
[11:01:44] <wikibugs>	 (03CR) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[11:01:56] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[11:06:31] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[11:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET namespaces) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:09:34] <wikibugs>	 (03PS7) 10Clément Goubert: doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816
[11:11:46] <wikibugs>	 (03CR) 10Clément Goubert: doc: add README.md (031 comment) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert)
[11:13:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[11:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST jobs) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:15:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Great thank you. The host is off, so please feel free to replace it whenever you like.
[11:15:57] <icinga-wm>	 PROBLEM - Disk space on ganeti6002 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti6002&var-datasource=drmrs+prometheus/ops
[11:16:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host puppetdb-test2001.codfw.wmnet
[11:16:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:21:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:21:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache puppetdb-test2001.codfw.wmnet on all recursors
[11:21:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetdb-test2001.codfw.wmnet on all recursors
[11:23:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P35224 and previous config saved to /var/cache/conftool/dbconfig/20220930-112307-root.json
[11:25:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Lucas_Werkmeister_WMDE)
[11:25:53] <nemo-yiannis>	 Hi, I am getting some failures from parsoid on deployment-prep that affects restbase tests. Here is the ticket: https://phabricator.wikimedia.org/T319009 What would be the right channel to reach out to?
[11:25:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[11:27:03] <Lucas_WMDE>	 nemo-yiannis: I would try #wikimedia-releng (wikibugs is already posting updates to the task there due to the relevant tags)
[11:27:15] <nemo-yiannis>	 Thanks Lucas_WMDE 
[11:29:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more sysctl fine-tuning [puppet] - 10https://gerrit.wikimedia.org/r/837088 (https://phabricator.wikimedia.org/T318824)
[11:31:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35225 and previous config saved to /var/cache/conftool/dbconfig/20220930-113101-root.json
[11:37:13] <icinga-wm>	 RECOVERY - Disk space on ganeti6002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti6002&var-datasource=drmrs+prometheus/ops
[11:41:46] <wikibugs>	 (03PS2) 10ArielGlenn: snapshot: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup)
[11:42:33] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] snapshot: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup)
[11:43:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  I think we may not need to disable the rp_filter on the physical, but it won't make a difference, as with the default route facing " [puppet] - 10https://gerrit.wikimedia.org/r/837088 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[11:44:09] <wikibugs>	 (03PS2) 10Muehlenhoff: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775
[11:45:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: l3_agent: more sysctl fine-tuning [puppet] - 10https://gerrit.wikimedia.org/r/837088 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez)
[11:45:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff)
[11:46:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35226 and previous config saved to /var/cache/conftool/dbconfig/20220930-114605-root.json
[11:46:25] <wikibugs>	 (03PS1) 10ArielGlenn: tiny whitespace fix in sql/xml dumps tables list [puppet] - 10https://gerrit.wikimedia.org/r/837089
[11:51:41] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] tiny whitespace fix in sql/xml dumps tables list [puppet] - 10https://gerrit.wikimedia.org/r/837089 (owner: 10ArielGlenn)
[11:52:12] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] tiny whitespace fix in sql/xml dumps tables list [puppet] - 10https://gerrit.wikimedia.org/r/837089 (owner: 10ArielGlenn)
[11:54:21] <wikibugs>	 (03CR) 10Hnowlan: thumbor: new service chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[11:55:56] <wikibugs>	 (03PS4) 10ArielGlenn: remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 (https://phabricator.wikimedia.org/T318894) (owner: 10Hokwelum)
[11:57:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] remove php7.2 from the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/836751 (https://phabricator.wikimedia.org/T318894) (owner: 10Hokwelum)
[11:59:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetdb-test2001.codfw.wmnet
[12:01:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35227 and previous config saved to /var/cache/conftool/dbconfig/20220930-120113-root.json
[12:07:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[12:08:12] <wikibugs>	 (03PS1) 10Muehlenhoff: mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013)
[12:08:14] <wikibugs>	 (03PS1) 10Muehlenhoff: ldap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013)
[12:08:16] <wikibugs>	 (03PS1) 10Muehlenhoff: docker_registry/imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013)
[12:08:18] <wikibugs>	 (03PS1) 10Muehlenhoff: tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013)
[12:08:20] <wikibugs>	 (03PS1) 10Muehlenhoff: alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837097 (https://phabricator.wikimedia.org/T308013)
[12:08:22] <wikibugs>	 (03PS1) 10Muehlenhoff: dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013)
[12:09:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Add DHCP entry for puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/837099 (https://phabricator.wikimedia.org/T318931)
[12:09:48] <wikibugs>	 (03PS2) 10Muehlenhoff: mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013)
[12:10:09] <wikibugs>	 (03PS2) 10Muehlenhoff: ldap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013)
[12:10:23] <wikibugs>	 (03PS2) 10Muehlenhoff: docker_registry/imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013)
[12:11:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ldap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:12:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] docker_registry/imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:16:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35228 and previous config saved to /var/cache/conftool/dbconfig/20220930-121618-root.json
[12:16:25] <wikibugs>	 (03PS3) 10Muehlenhoff: ldap profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013)
[12:16:42] <wikibugs>	 (03PS3) 10Muehlenhoff: docker_registry/imagecatalog profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013)
[12:17:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entry for puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/837099 (https://phabricator.wikimedia.org/T318931) (owner: 10Muehlenhoff)
[12:23:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for FPM on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/837101 (https://phabricator.wikimedia.org/T135991)
[12:25:13] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:26:11] <wikibugs>	 (03PS2) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845)
[12:29:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ldap profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:29:14] <wikibugs>	 (03PS4) 10Muehlenhoff: ldap profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837094 (https://phabricator.wikimedia.org/T308013)
[12:31:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35229 and previous config saved to /var/cache/conftool/dbconfig/20220930-123123-root.json
[12:32:03] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:35:18] <wikibugs>	 (03PS4) 10Muehlenhoff: docker_registry/imagecatalog profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013)
[12:37:57] <wikibugs>	 (03PS4) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244)
[12:39:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] docker_registry/imagecatalog profiles: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837095 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:40:33] <icinga-wm>	 PROBLEM - Check systemd state on elastic1096 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-omega-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:44:17] <wikibugs>	 (03PS2) 10Muehlenhoff: snapshot: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/837101
[12:46:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35230 and previous config saved to /var/cache/conftool/dbconfig/20220930-124628-root.json
[12:47:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Use correct auto restart define [puppet] - 10https://gerrit.wikimedia.org/r/837104
[12:48:11] <wikibugs>	 (03PS1) 10Esanders: Enable DiscussionTools mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837105 (https://phabricator.wikimedia.org/T317467)
[12:49:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Use correct auto restart define [puppet] - 10https://gerrit.wikimedia.org/r/837104 (owner: 10Muehlenhoff)
[12:50:15] <wikibugs>	 (03PS5) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244)
[12:50:16] <wikibugs>	 (03PS1) 10BBlack: Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963)
[12:51:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Use correct auto restart define [puppet] - 10https://gerrit.wikimedia.org/r/837104 (owner: 10Muehlenhoff)
[12:51:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963) (owner: 10BBlack)
[12:53:01] <wikibugs>	 (03PS2) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[12:53:03] <wikibugs>	 (03CR) 10BBlack: cache node disk layout p11n for F4 config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack)
[12:54:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack)
[12:55:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro)
[12:56:23] <wikibugs>	 (03PS2) 10BBlack: Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963)
[12:57:30] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Remove cp4021 + cp4027 p11n [puppet] - 10https://gerrit.wikimedia.org/r/837106 (https://phabricator.wikimedia.org/T318963) (owner: 10BBlack)
[13:01:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35231 and previous config saved to /var/cache/conftool/dbconfig/20220930-130133-root.json
[13:02:59] <wikibugs>	 10SRE, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4021 &n cp4027 - https://phabricator.wikimedia.org/T318963 (10BBlack)
[13:03:19] <wikibugs>	 10SRE, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4021 &n cp4027 - https://phabricator.wikimedia.org/T318963 (10BBlack) a:05BBlack→03RobH >>! In T318963#8274300, @RobH wrote: > Brandon, >  > Both of these hosts have had the decom script run, but they still have references in...
[13:05:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) @ayounsi  is there a time window you prefer?   I can be available  1pm UTC time I am available any day.
[13:06:01] <icinga-wm>	 RECOVERY - Check systemd state on elastic1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:35] <wikibugs>	 10SRE, 10Observability-Logging, 10Observability-Metrics, 10serviceops, and 2 others: Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10CDanis) Just pinging this task as OKR season is upon us and this might be a useful and fun thing to sneak...
[13:12:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s::apiserver: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff)
[13:13:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Add puppetdb-test2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/837110 (https://phabricator.wikimedia.org/T318931)
[13:13:35] <wikibugs>	 (03PS16) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:13:37] <wikibugs>	 (03PS3) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[13:14:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Add puppetdb-test2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/837110 (https://phabricator.wikimedia.org/T318931)
[13:14:35] <wikibugs>	 (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:14:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:14:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] coredns: add rewrite actions to the config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/836811 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey)
[13:15:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:16:00] <wikibugs>	 (03PS17) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:16:02] <wikibugs>	 (03PS4) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[13:16:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35232 and previous config saved to /var/cache/conftool/dbconfig/20220930-131638-root.json
[13:17:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add puppetdb-test2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/837110 (https://phabricator.wikimedia.org/T318931) (owner: 10Muehlenhoff)
[13:18:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:19:05] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:05] <wikibugs>	 (03PS1) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723)
[13:19:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro)
[13:19:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add custom DNS ttl rewrites for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837073 (https://phabricator.wikimedia.org/T318814) (owner: 10Elukey)
[13:19:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:20:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:22:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:22:13] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:22:26] <wikibugs>	 (03PS18) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[13:22:28] <wikibugs>	 (03PS5) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[13:22:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:23:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:23:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:23:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:23:29] <wikibugs>	 (03PS1) 10Kosta Harlan: Remove GEHomepageImpactModuleEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114
[13:24:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[13:25:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro)
[13:26:14] <wikibugs>	 (03PS2) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723)
[13:27:43] <icinga-wm>	 PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:31:23] <wikibugs>	 (03PS6) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[13:33:17] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:34:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077 (owner: 10David Caro)
[13:38:08] <wikibugs>	 (03CR) 10FNegri: "I have verified this is now working correctly by re-running the cookbook on a host that was already set up:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[13:44:20] <wikibugs>	 (03PS1) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946)
[13:44:43] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:17] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert)
[13:47:11] <wikibugs>	 (03PS1) 10JMeybohm: Disable zipkin and tracing for wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837117 (https://phabricator.wikimedia.org/T318814)
[13:47:13] <wikibugs>	 (03PS1) 10JMeybohm: Enable additional envoy native metrics in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/837118
[13:47:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[13:51:53] <moritzm>	 !log installing puppetdb-test2001 T318931
[13:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:58] <stashbot>	 T318931: codfw: 1 VMs requested for puppetdb-test2001 - https://phabricator.wikimedia.org/T318931
[13:52:05] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:52:19] <wikibugs>	 (03PS2) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946)
[13:53:55] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37399/console" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert)
[13:57:12] <wikibugs>	 (03PS1) 10Muehlenhoff: mariadb::stock_heartbeat: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837120
[13:57:33] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:59:08] <wikibugs>	 (03PS3) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946)
[13:59:37] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] doc: add README.md (031 comment) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert)
[13:59:51] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:00:19] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37400/console" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert)
[14:01:18] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37401/console" [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff)
[14:03:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I would be chicken and stop puppet on multiple masters before merging this, but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff)
[14:03:14] <wikibugs>	 (03PS1) 10Muehlenhoff: openstack::monitor::networktests: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837121
[14:03:16] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] k8s::apiserver: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837071 (owner: 10Muehlenhoff)
[14:08:25] <wikibugs>	 (03PS19) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:08:27] <wikibugs>	 (03PS7) 10David Caro: maintain-dbusers: enable CI tests, some refactor and fixes [puppet] - 10https://gerrit.wikimedia.org/r/837077
[14:08:29] <wikibugs>	 (03PS1) 10David Caro: flake8: Several pep8/flake8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/837126
[14:09:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:12:52] <wikibugs>	 (03CR) 10Elukey: Disable zipkin and tracing for wikikube clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837117 (https://phabricator.wikimedia.org/T318814) (owner: 10JMeybohm)
[14:14:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:15:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Enable additional envoy native metrics in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/837118 (owner: 10JMeybohm)
[14:23:47] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10nskaggs) I think dupe of T319025
[14:26:13] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:26:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:28:49] <wikibugs>	 10SRE, 10vm-requests: codfw: 1 VMs requested for puppetdb-test2001 - https://phabricator.wikimedia.org/T318931 (10MoritzMuehlenhoff) 05Open→03Resolved puppetdb-test2001 has been created and installed.
[14:29:49] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:30:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM. Likely can be merged even before wmf.4 lands, as we're 100% on true anyway. Thanks for making Growth in IS.php shorter! 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837114 (owner: 10Kosta Harlan)
[14:38:29] <wikibugs>	 (03PS1) 10Andrew Bogott: alerts.downtime_host: add a wildcard to the end of the hostname [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[14:40:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] alerts.downtime_host: add a wildcard to the end of the hostname (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[14:41:00] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[14:43:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: add a wildcard to the end of the hostname [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[14:43:49] <wikibugs>	 (03CR) 10Nskaggs: alerts.downtime_host: add a wildcard to the end of the hostname (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[14:45:38] <wikibugs>	 (03CR) 10Andrew Bogott: alerts.downtime_host: add a wildcard to the end of the hostname (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[14:49:34] <wikibugs>	 (03PS2) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[14:51:08] <wikibugs>	 (03PS3) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[14:55:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:57:21] <wikibugs>	 (03CR) 10Volans: "FYI inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[14:58:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[15:01:30] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[15:02:22] <wikibugs>	 (03PS3) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723)
[15:02:36] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:04:12] <icinga-wm>	 RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:15:50] <wikibugs>	 (03CR) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[15:16:31] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[15:16:35] <wikibugs>	 (03PS5) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943)
[15:17:38] <wikibugs>	 (03PS5) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943)
[15:19:19] <wikibugs>	 (03PS5) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246)
[15:20:03] <wikibugs>	 (03CR) 10David Caro: alerts.downtime_host: attempt to match alert hostnames with :<port> (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[15:20:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[15:20:32] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: fix _wait_for_osds (take 2) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837112 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[15:21:57] <wikibugs>	 (03PS6) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246)
[15:27:46] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[15:32:40] <wikibugs>	 (03PS4) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946)
[15:34:00] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37402/console" [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert)
[15:34:23] <wikibugs>	 10SRE, 10Traffic, 10decommission-hardware: decommission cp4021 &n cp4027 - https://phabricator.wikimedia.org/T318963 (10RobH) 05Open→03Resolved
[15:34:25] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[15:37:31] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[15:44:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Q4: esams atlas anchor - https://phabricator.wikimedia.org/T307021 (10RobH)
[15:45:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Q4: esams atlas anchor - https://phabricator.wikimedia.org/T307021 (10RobH)
[15:45:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:50:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:03] <wikibugs>	 (03CR) 10David Caro: "For irl chat:" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:05:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:08:37] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Andrew)
[16:10:26] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew)
[16:15:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:20:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T314041)', diff saved to https://phabricator.wikimedia.org/P35233 and previous config saved to /var/cache/conftool/dbconfig/20220930-162027-ladsgroup.json
[16:20:32] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[16:20:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:25:13] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:32:05] <wikibugs>	 (03CR) 10BryanDavis: "I would like to know more about how the notification system failed before abandoning the idea of the purge script. See T247517#8211187 for" [puppet] - 10https://gerrit.wikimedia.org/r/829231 (https://phabricator.wikimedia.org/T247517) (owner: 10Andrew Bogott)
[16:34:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[16:35:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P35234 and previous config saved to /var/cache/conftool/dbconfig/20220930-163533-ladsgroup.json
[16:50:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P35235 and previous config saved to /var/cache/conftool/dbconfig/20220930-165040-ladsgroup.json
[16:54:22] <logmsgbot>	 !log bblack@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye
[16:54:30] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye
[16:54:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10AnnWF)
[17:05:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T314041)', diff saved to https://phabricator.wikimedia.org/P35236 and previous config saved to /var/cache/conftool/dbconfig/20220930-170546-ladsgroup.json
[17:05:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[17:05:52] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[17:06:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[17:06:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T314041)', diff saved to https://phabricator.wikimedia.org/P35237 and previous config saved to /var/cache/conftool/dbconfig/20220930-170620-ladsgroup.json
[17:16:44] <wikibugs>	 (03CR) 10Samtar: "I'll be the first to admit that I'm not only unsure if this is needed, but I don't fully understand what it does — I'll mark this for revi" [puppet] - 10https://gerrit.wikimedia.org/r/837107 (https://phabricator.wikimedia.org/T317417) (owner: 10Samtar)
[17:17:16] <TheresNoTime>	 ^ "It's only puppet, what could go wrong?" :D
[17:18:19] <Lucas_WMDE>	 Friday evening, the best time for random puppet changes
[17:20:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg)
[17:24:51] <logmsgbot>	 !log bblack@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cp4045.ulsfo.wmnet with OS bullseye
[17:24:55] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**FAIL**)   - Removed f...
[17:25:00] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) cp4045 firmware inventory: bios is newest  1.6.5  10G nic is  22.00.07.60 , downgrading to 21.85.21.92 idrac is  5.10.30.00, cap at this and won't upgrade to 6.x which breaks http...
[17:26:16] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[17:28:36] <wikibugs>	 (03PS4) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[17:29:01] <wikibugs>	 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319057 (10Damilare)
[17:29:14] <wikibugs>	 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare)
[17:29:19] <wikibugs>	 (03CR) 10Wctaiwan: [C: 03+1] "Translations look good." [puppet] - 10https://gerrit.wikimedia.org/r/816161 (owner: 10Diskdance)
[17:32:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[17:33:23] <wikibugs>	 (03PS5) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[17:37:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[17:43:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye
[17:43:50] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye
[18:01:20] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS bullseye
[18:01:24] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**FAIL**)   - Removed fro...
[18:08:54] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye
[18:08:58] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye
[18:19:46] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: add hbs330 support to installer - https://phabricator.wikimedia.org/T319067 (10RobH)
[18:22:43] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH)
[18:22:50] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) {F35541613}  The last time I had an issue with driver support in the installer, I recall @MoritzMuehlenhoff being the person to help me out.  Moritz is this still the case, and are...
[18:23:26] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH)
[18:23:38] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) a:05RobH→03MoritzMuehlenhoff
[18:30:15] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS bullseye
[18:30:19] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye executed with errors: - cp4045 (**FAIL**)   - Removed fro...
[18:35:01] <wikibugs>	 (03PS6) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[18:35:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[18:48:12] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: Undeploy patch to use old PHP serialization in PHP 7.4 - https://phabricator.wikimedia.org/T318918 (10Jdforrester-WMF)
[19:03:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10AnnWF)
[19:21:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Aklapper) [Please don't copy some existing task. Please use the proper template and make sure the template is linked from a potential team onboarding doc. Thank you!]
[19:31:36] <wikibugs>	 (03PS1) 10Ebernhardson: Update elasticsearch memory pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/837180
[19:33:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Set binlog format for dbstore mariadb databases to ROW [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[19:34:37] <wikibugs>	 (03CR) 10Jcrespo: "This didn't work,import failed again :-(" [puppet] - 10https://gerrit.wikimedia.org/r/837083 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[19:37:54] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) I did a little digging from the `install_console` shell on this host.  lspci output for this adapter is: ` ~ # lspci -v -s 65:00.0 -nn 65:00.0 Ser...
[19:52:23] <wikibugs>	 (03PS1) 10Jdlrobson: Fix page toolbar border [skins/Vector] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836993 (https://phabricator.wikimedia.org/T318952)
[20:24:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, although I'm not familiar with the scripts it's trivial enough." [puppet] - 10https://gerrit.wikimedia.org/r/837126 (owner: 10David Caro)
[20:25:13] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:30:46] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[20:32:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:37:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:38:10] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[20:54:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2001.codfw.wmnet
[20:55:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare)
[20:56:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare)
[20:57:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Damilare)
[20:59:10] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) Update:  cp4037 is racked, but I had to steal its optic for T280202, since its cp4021 was busted anyhow. cp4045 is racked and accessible, but we've run into an installer issue on its insta...
[21:02:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2001.codfw.wmnet
[21:11:16] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Peachey88)
[21:17:22] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:43:30] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:07:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Was just notified by data center of delivery from dell.
[22:18:32] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T314041)', diff saved to https://phabricator.wikimedia.org/P35240 and previous config saved to /var/cache/conftool/dbconfig/20220930-224027-ladsgroup.json
[22:40:32] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[22:43:40] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P35241 and previous config saved to /var/cache/conftool/dbconfig/20220930-225534-ladsgroup.json
[23:10:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P35242 and previous config saved to /var/cache/conftool/dbconfig/20220930-231040-ladsgroup.json
[23:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T314041)', diff saved to https://phabricator.wikimedia.org/P35243 and previous config saved to /var/cache/conftool/dbconfig/20220930-232546-ladsgroup.json
[23:25:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:25:51] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[23:26:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:37:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:44:52] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:46:00] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook