[00:02:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36151 and previous config saved to /var/cache/conftool/dbconfig/20221025-000257-ladsgroup.json
[00:09:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321312)', diff saved to https://phabricator.wikimedia.org/P36152 and previous config saved to /var/cache/conftool/dbconfig/20221025-000904-ladsgroup.json
[00:09:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[00:09:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[00:09:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36153 and previous config saved to /var/cache/conftool/dbconfig/20221025-000931-ladsgroup.json
[00:10:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36154 and previous config saved to /var/cache/conftool/dbconfig/20221025-001057-ladsgroup.json
[00:11:33] <wikibugs>	 (03PS1) 10Andrew Bogott: C:ceph: ensure that the ceph keyring folder gets the correct owner/group [puppet] - 10https://gerrit.wikimedia.org/r/848557
[00:17:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36155 and previous config saved to /var/cache/conftool/dbconfig/20221025-001705-ladsgroup.json
[00:18:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P36156 and previous config saved to /var/cache/conftool/dbconfig/20221025-001804-ladsgroup.json
[00:18:59] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:30:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36157 and previous config saved to /var/cache/conftool/dbconfig/20221025-003211-ladsgroup.json
[00:33:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P36158 and previous config saved to /var/cache/conftool/dbconfig/20221025-003310-ladsgroup.json
[00:43:26] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "🤩🤩🤩" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[00:45:30] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Move wmgSiteLogoVariants to logos.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[00:47:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36159 and previous config saved to /var/cache/conftool/dbconfig/20221025-004718-ladsgroup.json
[00:48:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36160 and previous config saved to /var/cache/conftool/dbconfig/20221025-004817-ladsgroup.json
[00:53:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36161 and previous config saved to /var/cache/conftool/dbconfig/20221025-005332-ladsgroup.json
[01:02:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36162 and previous config saved to /var/cache/conftool/dbconfig/20221025-010225-ladsgroup.json
[01:07:46] <wikibugs>	 (03PS6) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[01:08:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P36163 and previous config saved to /var/cache/conftool/dbconfig/20221025-010839-ladsgroup.json
[01:09:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36164 and previous config saved to /var/cache/conftool/dbconfig/20221025-010943-ladsgroup.json
[01:09:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[01:23:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P36165 and previous config saved to /var/cache/conftool/dbconfig/20221025-012345-ladsgroup.json
[01:24:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P36166 and previous config saved to /var/cache/conftool/dbconfig/20221025-012449-ladsgroup.json
[01:30:30] <wikibugs>	 (03PS7) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[01:32:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[01:35:44] <wikibugs>	 (03PS8) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:38:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36167 and previous config saved to /var/cache/conftool/dbconfig/20221025-013852-ladsgroup.json
[01:38:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[01:39:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[01:39:17] <wikibugs>	 (03CR) 10Xcollazo: "Incorporated review comments. Please re-review." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[01:39:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36168 and previous config saved to /var/cache/conftool/dbconfig/20221025-013917-ladsgroup.json
[01:39:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P36169 and previous config saved to /var/cache/conftool/dbconfig/20221025-013956-ladsgroup.json
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36170 and previous config saved to /var/cache/conftool/dbconfig/20221025-014536-ladsgroup.json
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36171 and previous config saved to /var/cache/conftool/dbconfig/20221025-015502-ladsgroup.json
[01:55:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[01:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[01:55:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36172 and previous config saved to /var/cache/conftool/dbconfig/20221025-015528-ladsgroup.json
[01:55:47] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0200)
[02:00:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P36173 and previous config saved to /var/cache/conftool/dbconfig/20221025-020043-ladsgroup.json
[02:01:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36174 and previous config saved to /var/cache/conftool/dbconfig/20221025-020150-ladsgroup.json
[02:03:59] <wikibugs>	 (03PS1) 10Raymond Ndibe: p::toolforge:harbor::prepare: upgrade harbor to v2.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/848602 (https://phabricator.wikimedia.org/T316530)
[02:04:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:05:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:05:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:05:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:07:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.7 [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/848095 (https://phabricator.wikimedia.org/T320512)
[02:07:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.7 [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/848095 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P36175 and previous config saved to /var/cache/conftool/dbconfig/20221025-021550-ladsgroup.json
[02:16:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P36176 and previous config saved to /var/cache/conftool/dbconfig/20221025-021656-ladsgroup.json
[02:20:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:21:03] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:24:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.7 [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/848095 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[02:24:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:25:11] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:30:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36177 and previous config saved to /var/cache/conftool/dbconfig/20221025-023056-ladsgroup.json
[02:31:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[02:31:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[02:31:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321312)', diff saved to https://phabricator.wikimedia.org/P36178 and previous config saved to /var/cache/conftool/dbconfig/20221025-023120-ladsgroup.json
[02:31:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P36179 and previous config saved to /var/cache/conftool/dbconfig/20221025-023203-ladsgroup.json
[02:32:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:32:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:32:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:37:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321312)', diff saved to https://phabricator.wikimedia.org/P36180 and previous config saved to /var/cache/conftool/dbconfig/20221025-023733-ladsgroup.json
[02:40:09] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[02:47:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36181 and previous config saved to /var/cache/conftool/dbconfig/20221025-024709-ladsgroup.json
[02:47:15] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:49:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:50:55] <wikibugs>	 (03CR) 10Cwhite: miscweb: add rsyslog::input::files to send apache logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[02:52:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P36182 and previous config saved to /var/cache/conftool/dbconfig/20221025-025239-ladsgroup.json
[02:56:49] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0300)
[03:00:45] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:12] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848627 (https://phabricator.wikimedia.org/T320512)
[03:01:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848627 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[03:01:56] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848627 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[03:02:24] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.7  refs T320512
[03:02:29] <stashbot>	 T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512
[03:03:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:03:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:03:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:04:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:07:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P36183 and previous config saved to /var/cache/conftool/dbconfig/20221025-030745-ladsgroup.json
[03:09:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:10:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:10:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:11:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:17:17] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:17:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:19:34] <wikibugs>	 (03PS2) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:20:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:20:39] <icinga-wm>	 PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1108) taken on 2022-10-25 03:08:31 is 899 MiB, but the previous one was 1.2 GiB, a change of -25.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:20:57] <wikibugs>	 (03PS3) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:21:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:22:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321312)', diff saved to https://phabricator.wikimedia.org/P36184 and previous config saved to /var/cache/conftool/dbconfig/20221025-032252-ladsgroup.json
[03:22:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[03:23:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[03:23:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T321312)', diff saved to https://phabricator.wikimedia.org/P36185 and previous config saved to /var/cache/conftool/dbconfig/20221025-032316-ladsgroup.json
[03:29:02] <wikibugs>	 (03PS4) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:29:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:30:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321312)', diff saved to https://phabricator.wikimedia.org/P36186 and previous config saved to /var/cache/conftool/dbconfig/20221025-033039-ladsgroup.json
[03:34:18] <wikibugs>	 (03PS5) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:34:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:38:18] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.7  refs T320512 (duration: 35m 54s)
[03:38:23] <stashbot>	 T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512
[03:39:19] <wikibugs>	 (03PS6) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:39:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:40:16] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.5 (duration: 01m 56s)
[03:41:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:42:24] <wikibugs>	 (03PS7) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514)
[03:45:09] <wikibugs>	 (03CR) 10Andrew Bogott: "I didn't have much luck enumerating all the mgrs but this seems to work for the one that counts ($hostname)" [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[03:45:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P36187 and previous config saved to /var/cache/conftool/dbconfig/20221025-034546-ladsgroup.json
[03:48:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:48:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:54:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:57:15] <icinga-wm>	 PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:59:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[04:00:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P36188 and previous config saved to /var/cache/conftool/dbconfig/20221025-040052-ladsgroup.json
[04:01:19] <icinga-wm>	 RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:03:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[04:03:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[04:04:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:15:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321312)', diff saved to https://phabricator.wikimedia.org/P36189 and previous config saved to /var/cache/conftool/dbconfig/20221025-041558-ladsgroup.json
[04:18:01] <icinga-wm>	 PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:32:13] <icinga-wm>	 RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:12:55] <icinga-wm>	 PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:18:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T321177
[05:19:02] <stashbot>	 T321177: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T321177
[05:19:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T321177
[05:19:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1138 with weight 0 T321177', diff saved to https://phabricator.wikimedia.org/P36190 and previous config saved to /var/cache/conftool/dbconfig/20221025-051933-ladsgroup.json
[05:24:21] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:25:43] <_joe_>	 !log restarting pybal on lvs1020 to test cookbook mechanism
[05:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:44:17] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: an-worker1084, analytics1074, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, phab1004, releases1002, releases2002, relforge1003, relforge1004, stat1005, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_ru
[05:44:17] <icinga-wm>	 s
[05:47:17] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/844015 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot)
[05:47:23] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/844015 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot)
[05:47:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:48:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:56:18] <_joe_>	 !log restarting pybal again on lvs1020, again for testing
[05:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0600).
[06:00:10] <Amir1>	 o/
[06:00:14] <Amir1>	 let's go
[06:00:17] <marostegui>	 o/
[06:00:33] <Amir1>	 !log Starting s4 eqiad failover from db1160 to db1138 - T321177
[06:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:38] <stashbot>	 T321177: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T321177
[06:00:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T321177', diff saved to https://phabricator.wikimedia.org/P36191 and previous config saved to /var/cache/conftool/dbconfig/20221025-060043-ladsgroup.json
[06:01:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1138 to s4 primary and set section read-write T321177', diff saved to https://phabricator.wikimedia.org/P36192 and previous config saved to /var/cache/conftool/dbconfig/20221025-060118-ladsgroup.json
[06:02:44] <Amir1>	 it should be mostly done
[06:04:25] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot)
[06:05:06] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s4-master alias (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot)
[06:06:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1160 T321177', diff saved to https://phabricator.wikimedia.org/P36193 and previous config saved to /var/cache/conftool/dbconfig/20221025-060643-ladsgroup.json
[06:06:49] <stashbot>	 T321177: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T321177
[06:09:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:09:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:13:51] <icinga-wm>	 RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:14:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:14:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:15:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[06:15:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[06:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36194 and previous config saved to /var/cache/conftool/dbconfig/20221025-061552-ladsgroup.json
[06:16:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[06:16:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[06:16:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321312)', diff saved to https://phabricator.wikimedia.org/P36195 and previous config saved to /var/cache/conftool/dbconfig/20221025-061621-ladsgroup.json
[06:17:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[06:17:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36196 and previous config saved to /var/cache/conftool/dbconfig/20221025-061710-ladsgroup.json
[06:20:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[06:20:39] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: an-worker1084, analytics1074, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, phab1004, releases1002, releases2002, relforge1003, relforge1004, stat1005, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_ru
[06:20:39] <icinga-wm>	 s
[06:23:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36198 and previous config saved to /var/cache/conftool/dbconfig/20221025-062318-ladsgroup.json
[06:23:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321312)', diff saved to https://phabricator.wikimedia.org/P36199 and previous config saved to /var/cache/conftool/dbconfig/20221025-062337-ladsgroup.json
[06:25:17] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:32:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Removed kerberos principal for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/849007
[06:32:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 799 hosts
[06:33:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 799 hosts
[06:33:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Removed kerberos principal for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/849007 (owner: 10Muehlenhoff)
[06:33:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 1206 hosts
[06:34:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 1206 hosts
[06:36:57] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 7795
[06:38:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P36200 and previous config saved to /var/cache/conftool/dbconfig/20221025-063824-ladsgroup.json
[06:38:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7795
[06:38:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36201 and previous config saved to /var/cache/conftool/dbconfig/20221025-063843-ladsgroup.json
[06:42:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti4005 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/849010 (https://phabricator.wikimedia.org/T317247)
[06:49:13] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[06:50:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: further improvements for logging. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848241 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto)
[06:51:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4005 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/849010 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[06:52:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: update Istio settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/848344 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[06:52:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[06:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P36202 and previous config saved to /var/cache/conftool/dbconfig/20221025-065330-ladsgroup.json
[06:53:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36203 and previous config saved to /var/cache/conftool/dbconfig/20221025-065350-ladsgroup.json
[06:55:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[06:56:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[06:56:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[06:56:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[06:57:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[06:58:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[06:58:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[06:58:48] <wikibugs>	 (03PS1) 10Marostegui: db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849012
[06:59:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:59:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849012 (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0700). nyaa~
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36204 and previous config saved to /var/cache/conftool/dbconfig/20221025-070004-root.json
[07:00:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Marostegui) I am repooling this host now.
[07:04:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:04:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[07:04:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[07:05:22] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[07:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36205 and previous config saved to /var/cache/conftool/dbconfig/20221025-070837-ladsgroup.json
[07:08:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321312)', diff saved to https://phabricator.wikimedia.org/P36206 and previous config saved to /var/cache/conftool/dbconfig/20221025-070856-ladsgroup.json
[07:09:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[07:09:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[07:09:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T321312)', diff saved to https://phabricator.wikimedia.org/P36207 and previous config saved to /var/cache/conftool/dbconfig/20221025-070922-ladsgroup.json
[07:09:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite)
[07:10:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36208 and previous config saved to /var/cache/conftool/dbconfig/20221025-071053-ladsgroup.json
[07:15:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36209 and previous config saved to /var/cache/conftool/dbconfig/20221025-071509-root.json
[07:15:32] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37710/console" [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[07:16:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321312)', diff saved to https://phabricator.wikimedia.org/P36210 and previous config saved to /var/cache/conftool/dbconfig/20221025-071652-ladsgroup.json
[07:17:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet
[07:26:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P36211 and previous config saved to /var/cache/conftool/dbconfig/20221025-072600-ladsgroup.json
[07:26:41] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott)
[07:27:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet
[07:30:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36212 and previous config saved to /var/cache/conftool/dbconfig/20221025-073014-root.json
[07:31:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1
[07:31:51] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1
[07:31:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36213 and previous config saved to /var/cache/conftool/dbconfig/20221025-073159-ladsgroup.json
[07:38:21] <moritzm>	 !log installing 5.10.149-2 update on bullseye hosts (regression doesn't concern any of our servers, but still makes sense to have further reboots move to the latest kernel)
[07:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P36215 and previous config saved to /var/cache/conftool/dbconfig/20221025-074106-ladsgroup.json
[07:44:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:swift::storage: drop unused udev rule [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) (owner: 10Jbond)
[07:45:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36216 and previous config saved to /var/cache/conftool/dbconfig/20221025-074519-root.json
[07:47:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36217 and previous config saved to /var/cache/conftool/dbconfig/20221025-074705-ladsgroup.json
[07:48:22] <wikibugs>	 (03PS1) 10Jbond: C:swift::storage: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/849013 (https://phabricator.wikimedia.org/T308677)
[07:51:32] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:53] <wikibugs>	 (03CR) 10Muehlenhoff: "I'd say we should simply create sub team-specific roles? Such as role::insetup::infrastructure_foundations, role::insetup::data_persistenc" [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond)
[07:56:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36218 and previous config saved to /var/cache/conftool/dbconfig/20221025-075613-ladsgroup.json
[07:56:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:56:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:56:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36219 and previous config saved to /var/cache/conftool/dbconfig/20221025-075638-ladsgroup.json
[07:56:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36220 and previous config saved to /var/cache/conftool/dbconfig/20221025-075657-ladsgroup.json
[07:59:00] <wikibugs>	 (03PS1) 10Jbond: P:cumin::master: drop low-traffic from PoP sites [puppet] - 10https://gerrit.wikimedia.org/r/849014
[07:59:18] <wikibugs>	 (03PS1) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159)
[08:00:04] <jouncebot>	 jnuche and hashar: gettimeofday() says it's time for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0800)
[08:00:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37711/console" [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond)
[08:00:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36221 and previous config saved to /var/cache/conftool/dbconfig/20221025-080024-root.json
[08:01:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36222 and previous config saved to /var/cache/conftool/dbconfig/20221025-080153-ladsgroup.json
[08:02:09] <moritzm>	 !log drain ganeti1023 for eventual reimage T311687
[08:02:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321312)', diff saved to https://phabricator.wikimedia.org/P36223 and previous config saved to /var/cache/conftool/dbconfig/20221025-080212-ladsgroup.json
[08:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:14] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[08:02:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[08:02:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[08:02:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36224 and previous config saved to /var/cache/conftool/dbconfig/20221025-080238-ladsgroup.json
[08:03:15] <wikibugs>	 (03PS2) 10Jbond: P:cumin::master: drop low-traffic from PoP sites [puppet] - 10https://gerrit.wikimedia.org/r/849014
[08:03:43] <wikibugs>	 (03PS2) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159)
[08:04:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37712/console" [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond)
[08:07:04] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[08:07:08] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849017 (https://phabricator.wikimedia.org/T320512)
[08:07:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849017 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[08:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849017 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[08:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:10:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36225 and previous config saved to /var/cache/conftool/dbconfig/20221025-081007-ladsgroup.json
[08:10:16] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[08:10:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond)
[08:10:46] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet
[08:11:42] <wikibugs>	 (03PS3) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159)
[08:12:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:cumin::master: drop low-traffic from PoP sites [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond)
[08:12:23] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.7  refs T320512
[08:12:28] <stashbot>	 T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512
[08:12:39] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) >>! In T308677#8339622, @jbond wrote: >> luckily puppet doesn'...
[08:13:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT leases) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:14:14] <wikibugs>	 (03CR) 10Jelto: ""Job succeeded" - https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api/-/jobs/27102" [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro)
[08:14:26] <wikibugs>	 (03PS4) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159)
[08:15:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36226 and previous config saved to /var/cache/conftool/dbconfig/20221025-081529-root.json
[08:15:42] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P36227 and previous config saved to /var/cache/conftool/dbconfig/20221025-081700-ladsgroup.json
[08:17:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:17:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:17:30] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet
[08:18:02] <wikibugs>	 (03CR) 10Elukey: "I checked differences with https://github.com/coredns/helm, there are some but mostly related to:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[08:18:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:19:40] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet
[08:20:46] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet
[08:21:10] <wikibugs>	 (03CR) 10Elukey: "Helm lint currently adds the new option for endpointslices anyway, I am wondering if KubeVersion is not what we expect in CI." [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[08:24:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:25:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36228 and previous config saved to /var/cache/conftool/dbconfig/20221025-082514-ladsgroup.json
[08:26:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1
[08:26:53] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1
[08:29:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:29:47] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1065.eqiad.wmnet
[08:30:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36229 and previous config saved to /var/cache/conftool/dbconfig/20221025-083034-root.json
[08:32:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P36230 and previous config saved to /var/cache/conftool/dbconfig/20221025-083206-ladsgroup.json
[08:33:04] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) So far, all of the inbound messages to this mailing list have been held for moderat...
[08:36:45] <wikibugs>	 (03CR) 10Volans: "Nice to see a new cookbook! I've left some comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[08:36:49] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet
[08:40:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36232 and previous config saved to /var/cache/conftool/dbconfig/20221025-084020-ladsgroup.json
[08:41:44] <wikibugs>	 (03PS1) 10Jbond: insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020
[08:42:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[08:42:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:insetup: drop role contact I/F (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond)
[08:44:34] <wikibugs>	 (03PS7) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[08:44:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229)
[08:45:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:45:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[08:45:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:swift::storage: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/849013 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[08:45:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36233 and previous config saved to /var/cache/conftool/dbconfig/20221025-084541-root.json
[08:46:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[08:47:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36234 and previous config saved to /var/cache/conftool/dbconfig/20221025-084713-ladsgroup.json
[08:48:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "After merging this you need:" [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[08:48:43] <wikibugs>	 (03PS8) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[08:48:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229)
[08:49:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36235 and previous config saved to /var/cache/conftool/dbconfig/20221025-084929-ladsgroup.json
[08:49:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:51:22] <wikibugs>	 (03PS5) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143)
[08:54:23] <wikibugs>	 (03PS9) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[08:54:25] <wikibugs>	 (03PS3) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229)
[08:55:02] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[08:55:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Swap ganeti4003 with ganeti4005 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247)
[08:55:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36236 and previous config saved to /var/cache/conftool/dbconfig/20221025-085527-ladsgroup.json
[08:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[08:55:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[08:55:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[08:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[08:55:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36237 and previous config saved to /var/cache/conftool/dbconfig/20221025-085558-ladsgroup.json
[08:57:23] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet
[08:57:35] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet
[08:58:42] <wikibugs>	 (03PS1) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024
[09:00:08] <wikibugs>	 (03PS1) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667)
[09:01:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, deploy is only puppet-merge, no further action needed" [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[09:01:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Swap ganeti4003 with ganeti4005 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[09:01:08] <wikibugs>	 (03PS1) 10Jbond: insetup_noferm: add traffic as the owner of this role [puppet] - 10https://gerrit.wikimedia.org/r/849026
[09:02:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36238 and previous config saved to /var/cache/conftool/dbconfig/20221025-090213-ladsgroup.json
[09:02:36] <wikibugs>	 (03CR) 10Muehlenhoff: site.pp: move insetup hosts to the team specific role (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond)
[09:04:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P36239 and previous config saved to /var/cache/conftool/dbconfig/20221025-090436-ladsgroup.json
[09:06:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+2] "I think its best to override CI on this one.  I think it makes more sense to just use the system::role that comes from role::insetup" [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[09:10:32] <wikibugs>	 (03PS2) 10Jbond: insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020
[09:11:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[09:11:55] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add production-side filters for CampaignEvents extension tables [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595)
[09:13:48] <wikibugs>	 (03CR) 10MVernon: "My slight concern with this is that these two jobs currently produce a ping on #wikimedia-data-persistence on IRC; with this change, I thi" [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[09:13:51] <wikibugs>	 (03CR) 10Btullis: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[09:14:17] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet
[09:14:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet
[09:15:40] <wikibugs>	 (03CR) 10Jcrespo: "Based on https://phabricator.wikimedia.org/P35370 there are no fully private tables to add to manifests/realm.pp but please check." [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo)
[09:16:10] <wikibugs>	 (03CR) 10MVernon: Use generic 'Check systemd state' alert to catch timer failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[09:17:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add cookbook to restart pybal (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[09:17:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36240 and previous config saved to /var/cache/conftool/dbconfig/20221025-091720-ladsgroup.json
[09:17:56] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[09:19:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Swap ganeti4002/ganeti4003 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247)
[09:19:39] <wikibugs>	 (03PS1) 10Volans: CORE_DATACENTERS: use the wmflib constant [cookbooks] - 10https://gerrit.wikimedia.org/r/849031
[09:19:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P36241 and previous config saved to /var/cache/conftool/dbconfig/20221025-091942-ladsgroup.json
[09:21:59] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[09:22:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[09:22:49] <wikibugs>	 (03PS1) 10David Caro: p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514)
[09:23:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Swap ganeti4002/ganeti4003 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[09:23:44] <wikibugs>	 (03PS2) 10David Caro: p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514)
[09:24:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: Use generic 'Check systemd state' alert to catch timer failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[09:25:20] <wikibugs>	 (03PS1) 10FNegri: Fix reprepro config for thirdparty/tekton [puppet] - 10https://gerrit.wikimedia.org/r/849033 (https://phabricator.wikimedia.org/T317143)
[09:25:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37713/console" [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro)
[09:25:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro)
[09:26:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix reprepro config for thirdparty/tekton [puppet] - 10https://gerrit.wikimedia.org/r/849033 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[09:26:38] <wikibugs>	 (03PS3) 10David Caro: p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514)
[09:26:47] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro)
[09:27:12] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet
[09:27:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/849031 (owner: 10Volans)
[09:28:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Swap ganeti4002/ganeti4003 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[09:28:59] <wikibugs>	 (03PS4) 10David Caro: p::ceph:mon: set permissions if mgr key parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514)
[09:29:05] <wikibugs>	 (03CR) 10Volans: "Replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[09:30:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo)
[09:30:34] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) @Dzahn - would it be acceptable for us to use the exim aliases file to forward the...
[09:32:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36243 and previous config saved to /var/cache/conftool/dbconfig/20221025-093226-ladsgroup.json
[09:32:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:34:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36244 and previous config saved to /var/cache/conftool/dbconfig/20221025-093449-ladsgroup.json
[09:34:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[09:35:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[09:35:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36245 and previous config saved to /var/cache/conftool/dbconfig/20221025-093513-ladsgroup.json
[09:36:09] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond)
[09:36:11] <wikibugs>	 (03PS1) 10Volans: tox.ini: explain why there are old Python versions [cookbooks] - 10https://gerrit.wikimedia.org/r/849034 (https://phabricator.wikimedia.org/T289222)
[09:36:26] <wikibugs>	 (03PS2) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024
[09:36:43] <moritzm>	 !log drain ganeti4002 for eventual decom T317247
[09:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:48] <stashbot>	 T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247
[09:41:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36246 and previous config saved to /var/cache/conftool/dbconfig/20221025-094122-ladsgroup.json
[09:44:16] <wikibugs>	 (03CR) 10Muehlenhoff: "Two more, missed them before." [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond)
[09:47:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36247 and previous config saved to /var/cache/conftool/dbconfig/20221025-094733-ladsgroup.json
[09:47:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[09:47:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[09:48:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36248 and previous config saved to /var/cache/conftool/dbconfig/20221025-094800-ladsgroup.json
[09:49:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36249 and previous config saved to /var/cache/conftool/dbconfig/20221025-094921-ladsgroup.json
[09:51:42] <wikibugs>	 (03CR) 10JMeybohm: coredns: support up to upstream version 1.8.7 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[09:51:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:52:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:55:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36250 and previous config saved to /var/cache/conftool/dbconfig/20221025-095527-ladsgroup.json
[09:55:31] <wikibugs>	 (03CR) 10Jbond: "thanks done" [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond)
[09:55:46] <wikibugs>	 (03PS3) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024
[09:56:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P36251 and previous config saved to /var/cache/conftool/dbconfig/20221025-095629-ladsgroup.json
[09:57:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[09:57:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet
[09:57:35] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:58:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond)
[09:58:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/849026 (owner: 10Jbond)
[09:59:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+2] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[09:59:14] <wikibugs>	 (03PS4) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024
[09:59:21] <wikibugs>	 (03PS2) 10Jbond: insetup_noferm: add traffic as the owner of this role [puppet] - 10https://gerrit.wikimedia.org/r/849026
[10:00:29] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover all m*-master [dns] - 10https://gerrit.wikimedia.org/r/849039 (https://phabricator.wikimedia.org/T321312)
[10:02:46] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add a Istio vs and retry settings on ml-serve for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/849040 (https://phabricator.wikimedia.org/T320374)
[10:03:37] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:03:37] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[10:03:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849026 (owner: 10Jbond)
[10:05:37] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:06:45] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I don't mind the addition of the new roles, it makes sense to me although a bit verbose. Just make sure that DCOps is onboard with it as t" [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[10:07:12] <wikibugs>	 (03CR) 10Vgutierrez: Add cookbook to restart pybal (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[10:07:16] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Fix reprepro config for thirdparty/tekton [puppet] - 10https://gerrit.wikimedia.org/r/849033 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[10:07:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Amir, this requires sanitarium puppet runs + mariadb restarts." [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza)
[10:10:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P36252 and previous config saved to /var/cache/conftool/dbconfig/20221025-101034-ladsgroup.json
[10:11:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P36253 and previous config saved to /var/cache/conftool/dbconfig/20221025-101135-ladsgroup.json
[10:11:57] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:18:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] insetup: add team specific insetup roles to ease ownership identification (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[10:22:15] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:25:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P36254 and previous config saved to /var/cache/conftool/dbconfig/20221025-102540-ladsgroup.json
[10:26:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[10:26:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36255 and previous config saved to /var/cache/conftool/dbconfig/20221025-102642-ladsgroup.json
[10:26:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[10:27:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[10:27:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:27:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] insetup: add team specific insetup roles to ease ownership identification (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[10:27:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:27:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T321312)', diff saved to https://phabricator.wikimedia.org/P36256 and previous config saved to /var/cache/conftool/dbconfig/20221025-102724-ladsgroup.json
[10:28:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:44] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez)
[10:29:59] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) p:05Triage→03Medium
[10:30:29] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:31:35] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez)
[10:31:45] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.314 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:31:48] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet
[10:32:21] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet
[10:32:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover all m*-master [dns] - 10https://gerrit.wikimedia.org/r/849039 (https://phabricator.wikimedia.org/T321312) (owner: 10Marostegui)
[10:33:15] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 2.332 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:33:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321312)', diff saved to https://phabricator.wikimedia.org/P36257 and previous config saved to /var/cache/conftool/dbconfig/20221025-103346-ladsgroup.json
[10:40:06] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 03+1] "Yup, LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo)
[10:40:11] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:40:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36258 and previous config saved to /var/cache/conftool/dbconfig/20221025-104047-ladsgroup.json
[10:40:53] <wikibugs>	 (03PS1) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[10:41:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet
[10:41:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[10:41:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:42:32] <wikibugs>	 (03PS2) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[10:43:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36259 and previous config saved to /var/cache/conftool/dbconfig/20221025-104303-ladsgroup.json
[10:43:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[10:43:13] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37714/console" [puppet] - 10https://gerrit.wikimedia.org/r/841908 (owner: 10JMeybohm)
[10:43:30] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] dragonfly::dfdaemon: Fix dummy ssl_paths object [puppet] - 10https://gerrit.wikimedia.org/r/841908 (owner: 10JMeybohm)
[10:43:34] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet
[10:48:29] <wikibugs>	 (03PS3) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[10:48:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P36260 and previous config saved to /var/cache/conftool/dbconfig/20221025-104852-ladsgroup.json
[10:49:21] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:50:06] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37715/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[10:51:58] <wikibugs>	 10SRE, 10Observability-Metrics, 10serviceops, 10Maps (Kartotherian), 10Patch-For-Review: Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10hnowlan)
[10:53:32] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet
[10:54:12] <wikibugs>	 (03PS4) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[10:55:11] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37716/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[10:58:08] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Zabe)
[10:58:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36261 and previous config saved to /var/cache/conftool/dbconfig/20221025-105810-ladsgroup.json
[11:03:51] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:04:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P36262 and previous config saved to /var/cache/conftool/dbconfig/20221025-110359-ladsgroup.json
[11:11:39] <wikibugs>	 (03PS1) 10Jbond: aptrepo: create a component to backport python3.9 to unblock CI [puppet] - 10https://gerrit.wikimedia.org/r/849049 (https://phabricator.wikimedia.org/T289222)
[11:12:13] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:13:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36263 and previous config saved to /var/cache/conftool/dbconfig/20221025-111316-ladsgroup.json
[11:16:37] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) What you did for accepting non-members looks good to me. I haven't seen any held...
[11:19:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321312)', diff saved to https://phabricator.wikimedia.org/P36264 and previous config saved to /var/cache/conftool/dbconfig/20221025-111906-ladsgroup.json
[11:19:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[11:19:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[11:19:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T321312)', diff saved to https://phabricator.wikimedia.org/P36265 and previous config saved to /var/cache/conftool/dbconfig/20221025-111930-ladsgroup.json
[11:19:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover all m*-master [dns] - 10https://gerrit.wikimedia.org/r/849039 (https://phabricator.wikimedia.org/T321312) (owner: 10Marostegui)
[11:22:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/849049 (https://phabricator.wikimedia.org/T289222) (owner: 10Jbond)
[11:23:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond)
[11:23:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond)
[11:23:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] insetup_noferm: add traffic as the owner of this role [puppet] - 10https://gerrit.wikimedia.org/r/849026 (owner: 10Jbond)
[11:24:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add new tekton package to WMCS bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[11:25:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321312)', diff saved to https://phabricator.wikimedia.org/P36266 and previous config saved to /var/cache/conftool/dbconfig/20221025-112527-ladsgroup.json
[11:25:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] aptrepo: create a component to backport python3.9 to unblock CI [puppet] - 10https://gerrit.wikimedia.org/r/849049 (https://phabricator.wikimedia.org/T289222) (owner: 10Jbond)
[11:28:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: realm.pp: introduce $::wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849050
[11:28:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36267 and previous config saved to /var/cache/conftool/dbconfig/20221025-112822-ladsgroup.json
[11:28:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[11:28:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Add production-side filters for CampaignEvents extension tables [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo)
[11:28:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[11:28:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36268 and previous config saved to /var/cache/conftool/dbconfig/20221025-112848-ladsgroup.json
[11:29:17] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37717/console" [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall)
[11:33:29] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org
[11:34:12] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org
[11:34:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36269 and previous config saved to /var/cache/conftool/dbconfig/20221025-113455-ladsgroup.json
[11:35:22] <wikibugs>	 (03PS2) 10Ladsgroup: Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza)
[11:35:25] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza)
[11:37:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: refresh licence [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013)
[11:37:37] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org
[11:37:51] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: refresh licence [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013)
[11:38:08] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2002.wikimedia.org
[11:40:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P36270 and previous config saved to /var/cache/conftool/dbconfig/20221025-114034-ladsgroup.json
[11:41:39] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall)
[11:41:52] <icinga-wm>	 PROBLEM - IPMI Sensor Status on kafka-logging1004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:43:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti4002.ulsfo.wmnet with reason: Remove from cluster for eventual decom
[11:43:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti4002.ulsfo.wmnet with reason: Remove from cluster for eventual decom
[11:46:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti4002 from Puppet for decom [puppet] - 10https://gerrit.wikimedia.org/r/849054 (https://phabricator.wikimedia.org/T317247)
[11:49:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti4002 from Puppet for decom [puppet] - 10https://gerrit.wikimedia.org/r/849054 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[11:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36271 and previous config saved to /var/cache/conftool/dbconfig/20221025-115002-ladsgroup.json
[11:54:52] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 132203
[11:55:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4002.ulsfo.wmnet
[11:55:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P36272 and previous config saved to /var/cache/conftool/dbconfig/20221025-115540-ladsgroup.json
[11:57:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 132203
[11:59:57] <wikibugs>	 (03CR) 10David Caro: "we should not be installing tekton-cli directly, if needed, it should be pulled by toolforge-cli, so we will want to configure the reposit" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:00:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:05:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36273 and previous config saved to /var/cache/conftool/dbconfig/20221025-120509-ladsgroup.json
[12:07:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[12:09:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:09:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti4002.ulsfo.wmnet
[12:09:28] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti4002.ulsfo.wmnet` - ganeti4002.ulsfo.wmnet (**PASS**)...
[12:10:37] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: "The FOO list has N moderation requests waiting." notifications can't be turned off in Mailman 3 - https://phabricator.wikimedia.org/T284107 (10GreenReaper) Even if you only get one spam email a day, it basically means you are quite likely to get two emails about it rather tha...
[12:10:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321312)', diff saved to https://phabricator.wikimedia.org/P36274 and previous config saved to /var/cache/conftool/dbconfig/20221025-121047-ladsgroup.json
[12:10:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[12:11:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[12:11:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36275 and previous config saved to /var/cache/conftool/dbconfig/20221025-121111-ladsgroup.json
[12:16:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add a Istio vs and retry settings on ml-serve for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/849040 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[12:16:42] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10MoritzMuehlenhoff) I have setup ganeti4005 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected.  @RobH : I've als...
[12:17:16] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36276 and previous config saved to /var/cache/conftool/dbconfig/20221025-121730-ladsgroup.json
[12:18:24] <Amir1>	 I restarted mailman services, let's see if it fixes
[12:19:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[12:19:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[12:20:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36277 and previous config saved to /var/cache/conftool/dbconfig/20221025-122015-ladsgroup.json
[12:22:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:22:40] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2519
[12:23:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2519
[12:23:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059
[12:26:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance
[12:26:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance
[12:27:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061
[12:28:20] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[12:29:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/849063
[12:29:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance
[12:29:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance
[12:30:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T321312)', diff saved to https://phabricator.wikimedia.org/P36278 and previous config saved to /var/cache/conftool/dbconfig/20221025-123001-ladsgroup.json
[12:30:40] <wikibugs>	 (03PS1) 10Ladsgroup: Add add_el_to_domain_index_T318605.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849065 (https://phabricator.wikimedia.org/T318605)
[12:31:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[12:32:04] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: virgin is 27 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[12:32:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P36279 and previous config saved to /var/cache/conftool/dbconfig/20221025-123236-ladsgroup.json
[12:33:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1023.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[12:33:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1023.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[12:36:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321312)', diff saved to https://phabricator.wikimedia.org/P36280 and previous config saved to /var/cache/conftool/dbconfig/20221025-123615-ladsgroup.json
[12:36:44] <wikibugs>	 (03PS5) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[12:37:12] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[12:37:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:38:00] <wikibugs>	 10SRE, 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10fgiunchedi) I can't reproduce the issue ATM via  https://grafana.wikimedia.org/goto/14wkdONVz?orgId=1 however your intuition is correct: the interval for...
[12:38:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1023.eqiad.wmnet with OS bullseye
[12:38:51] <hashar>	 !log Restarting CI Jenkins
[12:38:53] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS bullseye
[12:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:39] <moritzm>	 !log drain ganeti1015 for eventual reimage T311687
[12:39:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:47] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[12:42:51] <hashar>	 oh thanks systemd for killing jenkins grr
[12:44:20] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:44:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: timer::job: remove monitoring_enabled [puppet] - 10https://gerrit.wikimedia.org/r/849088 (https://phabricator.wikimedia.org/T303253)
[12:45:14] <wikibugs>	 (03PS1) 10Clément Goubert: aptrepo: add component thirdparty/otelcol-contrib [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551)
[12:46:30] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37721/console" [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551) (owner: 10Clément Goubert)
[12:46:50] <wikibugs>	 (03PS6) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[12:47:24] <wikibugs>	 (03Abandoned) 10Andrew Bogott: C:ceph: ensure that the ceph keyring folder gets the correct owner/group [puppet] - 10https://gerrit.wikimedia.org/r/848557 (owner: 10Andrew Bogott)
[12:47:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:47:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P36281 and previous config saved to /var/cache/conftool/dbconfig/20221025-124743-ladsgroup.json
[12:51:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.579 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P36283 and previous config saved to /var/cache/conftool/dbconfig/20221025-125122-ladsgroup.json
[12:51:59] <wikibugs>	 (03PS7) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[12:52:54] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37724/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:52:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:53:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1023.eqiad.wmnet with reason: host reimage
[12:53:38] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:55:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1023.eqiad.wmnet with reason: host reimage
[12:58:42] <wikibugs>	 10SRE, 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10Vgutierrez) oh got it, thanks @fgiunchedi  @BCornwall please update the min step to 2m in the dashboard.. maybe adding a hidden variable and referencing...
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T1300).
[13:00:05] <jouncebot>	 koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T1300)
[13:00:09] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:00:27] <koi>	 o/
[13:00:57] <wikibugs>	 (03PS8) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143)
[13:02:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36284 and previous config saved to /var/cache/conftool/dbconfig/20221025-130249-ladsgroup.json
[13:02:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:03:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:03:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T321312)', diff saved to https://phabricator.wikimedia.org/P36285 and previous config saved to /var/cache/conftool/dbconfig/20221025-130314-ladsgroup.json
[13:06:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P36286 and previous config saved to /var/cache/conftool/dbconfig/20221025-130628-ladsgroup.json
[13:07:02] <kostajh>	 hi
[13:07:12] <kostajh>	 I have an addition to the backport window
[13:09:25] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849092 (https://phabricator.wikimedia.org/T304549)
[13:09:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321312)', diff saved to https://phabricator.wikimedia.org/P36287 and previous config saved to /var/cache/conftool/dbconfig/20221025-130931-ladsgroup.json
[13:11:04] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) I've been scratching my head a little on this because the alert seemingly *has* fired:  {F35624931}  {F35624934}  Yet I can't find any notification ATM
[13:11:06] <wikibugs>	 (03PS1) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[13:11:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/849094
[13:11:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to drbd
[13:11:18] <kostajh>	 koi: I don't have enough time to backport your change, I'm afraid. Are any of the other deployers around?
[13:12:11] <koi>	 don't know ..
[13:12:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1023.eqiad.wmnet with OS bullseye
[13:13:43] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:849092|GrowthExperiments: Enable link recommendation for aswiki (T304549)]]
[13:14:11] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:849092|GrowthExperiments: Enable link recommendation for aswiki (T304549)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:14:35] <wikibugs_>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[13:15:16] <wikibugs_>	 (03PS1) 10Klausman: wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[13:15:49] <Lucas_WMDE>	 o/
[13:16:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 (owner: 10Muehlenhoff)
[13:16:07] * Lucas_WMDE looks
[13:16:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Set profile::contacts::role_contacts for gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/849063 (owner: 10Muehlenhoff)
[13:16:24] <Lucas_WMDE>	 oh, still the big logos change :sweat_sm
[13:16:33] * Lucas_WMDE needs to learn how emojis work in irccloud
[13:16:40] <Lucas_WMDE>	 😅 is what I meant
[13:16:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 (owner: 10Muehlenhoff)
[13:16:54] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 (owner: 10Muehlenhoff)
[13:17:06] <Lucas_WMDE>	 oh wait, no it’s a different one
[13:17:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 (owner: 10Muehlenhoff)
[13:17:19] <wikibugs>	 (03CR) 10Volans: "couple of questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[13:17:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[13:17:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 (owner: 10Muehlenhoff)
[13:17:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013) (owner: 10Arturo Borrero Gonzalez)
[13:17:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:17:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 (owner: 10Muehlenhoff)
[13:17:49] <koi>	 yeah the one yesterday was processed 
[13:17:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM! Thanks Cwhite!" [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite)
[13:17:59] <Lucas_WMDE>	 nice
[13:18:05] <Lucas_WMDE>	 okay, looking
[13:18:14] <wikibugs>	 (03CR) 10Majavah: wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[13:18:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:18:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:18:56] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] Use generic 'Check systemd state' alert to catch timer failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[13:19:13] <wikibugs>	 (03CR) 10Klausman: wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[13:19:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:19:29] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:849092|GrowthExperiments: Enable link recommendation for aswiki (T304549)]] (duration: 05m 45s)
[13:20:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Use generic 'Check systemd state' alert to catch timer failures [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253)
[13:20:24] <kostajh>	 Lucas_WMDE: done with backporting my patch.
[13:20:29] <wikibugs>	 (03PS2) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[13:20:33] <Lucas_WMDE>	 ok, I’m reviewing koi’s patch
[13:20:37] <wikibugs>	 (03CR) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[13:21:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to drbd
[13:21:17] <icinga-wm>	 PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:21:29] <icinga-wm>	 RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms
[13:21:35] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:21:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321312)', diff saved to https://phabricator.wikimedia.org/P36288 and previous config saved to /var/cache/conftool/dbconfig/20221025-132135-ladsgroup.json
[13:21:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:21:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:22:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T321312)', diff saved to https://phabricator.wikimedia.org/P36289 and previous config saved to /var/cache/conftool/dbconfig/20221025-132201-ladsgroup.json
[13:22:09] <wikibugs>	 (03PS3) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[13:22:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Move wmgSiteLogoVariants to logos.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:22:54] <Lucas_WMDE>	 koi: just a tiny comment
[13:22:57] <Lucas_WMDE>	 looks good otherwise
[13:23:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Move wmgSiteLogoVariants to logos.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:23:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[13:24:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[13:24:22] <wikibugs>	 (03PS2) 10Stang: Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620)
[13:24:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P36290 and previous config saved to /var/cache/conftool/dbconfig/20221025-132438-ladsgroup.json
[13:24:50] <wikibugs>	 (03CR) 10Stang: Move wmgSiteLogoVariants to logos.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:25:03] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:25:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:26:49] <wikibugs>	 (03Merged) 10jenkins-bot: Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[13:27:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:848552|Move wmgSiteLogoVariants to logos.php (T308620 T321519)]]
[13:27:14] <wikibugs>	 (03PS2) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[13:27:19] <stashbot>	 T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620
[13:27:19] <stashbot>	 T321519: Define wmgSiteLogoVariants in logos/config.yaml - https://phabricator.wikimedia.org/T321519
[13:27:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:848552|Move wmgSiteLogoVariants to logos.php (T308620 T321519)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:27:43] <Lucas_WMDE>	 koi: ^
[13:27:49] <Lucas_WMDE>	 I guess there’s nothing to test – just check nothing’s broken?
[13:28:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[13:28:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[13:28:30] <wikibugs>	 (03CR) 10Elukey: coredns: support up to upstream version 1.8.7 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[13:28:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321312)', diff saved to https://phabricator.wikimedia.org/P36291 and previous config saved to /var/cache/conftool/dbconfig/20221025-132839-ladsgroup.json
[13:28:54] <koi>	 Lucas_WMDE: I tested all five involved projects, and there's no changes for the logo variants, so LGTM
[13:28:59] <Lucas_WMDE>	 \o/ thanks
[13:29:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:30:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:31:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[13:31:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:31:33] <wikibugs>	 (03PS3) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[13:31:42] <wikibugs>	 (03CR) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[13:32:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[13:32:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[13:33:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:848552|Move wmgSiteLogoVariants to logos.php (T308620 T321519)]] (duration: 05m 47s)
[13:33:06] <stashbot>	 T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620
[13:33:07] <stashbot>	 T321519: Define wmgSiteLogoVariants in logos/config.yaml - https://phabricator.wikimedia.org/T321519
[13:33:21] <Lucas_WMDE>	 anything else to deploy?
[13:33:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add add_el_to_domain_index_T318605.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849065 (https://phabricator.wikimedia.org/T318605) (owner: 10Ladsgroup)
[13:33:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[13:34:28] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:34:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to plain
[13:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[13:35:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/849094 (owner: 10Muehlenhoff)
[13:35:05] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [restbase/deploy@5575605]: Update restbase to c1d391c7
[13:35:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to plain
[13:36:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:37:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:37:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:37:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:38:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance
[13:38:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance
[13:38:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Use generic 'Check systemd state' alert to catch timer failures [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[13:39:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[13:39:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P36292 and previous config saved to /var/cache/conftool/dbconfig/20221025-133944-ladsgroup.json
[13:40:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:42:11] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:17] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:21] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:40] <marostegui>	 Amir1: ^ 
[13:42:47] <marostegui>	 That comes from db1154 
[13:42:56] <Amir1>	 I'm restarting it
[13:43:04] <Amir1>	 I think I haven't downtimed clouddbs
[13:43:06] <Amir1>	 sorry
[13:43:22] <Amir1>	 it should be up in one or two minutes
[13:43:39] <Amir1>	 marostegui: will it page? 
[13:43:46] <marostegui>	 I think it pages WMCS
[13:43:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P36293 and previous config saved to /var/cache/conftool/dbconfig/20221025-134345-ladsgroup.json
[13:43:49] <marostegui>	 But I am not fully sure
[13:44:17] <Amir1>	 ok
[13:45:07] <wikibugs>	 (03CR) 10Herron: dispatch: introduce profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[13:46:17] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:47:07] <Amir1>	 I have a feeling it's not coming back online, wait for a couple of minutes more and see
[13:47:36] <marostegui>	 I am connected to the console, let me see
[13:50:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.426 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:51:50] <marostegui>	 Amir1: We probably need a task with ops-eqiad
[13:52:03] <marostegui>	 I am trying to send hard resets and poweroff/on but it doesn't seem to be doing anything
[13:52:13] <Amir1>	 okay
[13:52:26] <Amir1>	 let me downtime clouddbs
[13:53:20] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@5575605]: Update restbase to c1d391c7 (duration: 18m 14s)
[13:53:25] <wikibugs>	 10SRE, 10Patch-For-Review, 10User-jbond: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10ayounsi)  Some notes/thoughts from a chat with @jbond: * Based on P36282 and except `Data Engineering,Machine Learning` all servers have 1 clear team owner * `role_contacts` has been ext...
[13:53:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1020-1021].eqiad.wmnet with reason: db1154 having hw issues
[13:53:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1020-1021].eqiad.wmnet with reason: db1154 having hw issues
[13:54:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[13:54:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron, glance, cinder, keystone: Move api firewall rules into haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/849098 (https://phabricator.wikimedia.org/T319312)
[13:54:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321312)', diff saved to https://phabricator.wikimedia.org/P36294 and previous config saved to /var/cache/conftool/dbconfig/20221025-135451-ladsgroup.json
[13:54:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[13:55:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[13:55:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T321312)', diff saved to https://phabricator.wikimedia.org/P36295 and previous config saved to /var/cache/conftool/dbconfig/20221025-135515-ladsgroup.json
[13:55:30] <Amir1>	 marostegui: are you creating the ticket or should I?
[13:55:54] <marostegui>	 Amir1: Please do it, I am still trying if I can get it back
[13:56:01] <Amir1>	 sure
[13:56:13] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Seems most of the unresolved comments have been addressed by now, maybe one or two minor things remaining, LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff)
[13:56:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:57:34] <wikibugs>	 10ops-eqiad: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Ladsgroup)
[13:57:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add cookbook to restart pybal (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[13:58:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P36296 and previous config saved to /var/cache/conftool/dbconfig/20221025-135852-ladsgroup.json
[13:59:07] <XioNoX>	 !log test bouncing VC port on asw2-d-eqiad
[13:59:08] <wikibugs>	 (03PS1) 10Majavah: openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503)
[13:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:13] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[13:59:34] <wikibugs>	 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Marostegui) p:05Triage→03High I have tried to power it off and then back on from the console but I was getting weird outputs like:  ` racadm>>serveraction powerstatus Server power status: OFF racadm>>...
[13:59:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah)
[13:59:58] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:00:15] <wikibugs>	 (03PS2) 10Majavah: openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503)
[14:00:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 8.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321312)', diff saved to https://phabricator.wikimedia.org/P36297 and previous config saved to /var/cache/conftool/dbconfig/20221025-140131-ladsgroup.json
[14:02:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Neutron, glance, cinder, keystone: Move api firewall rules into haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/849098 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[14:03:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah)
[14:04:03] <wikibugs>	 (03Abandoned) 10Btullis: Add cumin aliases for dse-k8s in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/843932 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[14:04:48] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a simple mechanism for creating postgresql users and databases [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[14:05:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) As data point I tried: `asw2-d-eqiad# run request virtual-chassis vc-port set pic-slot 0 member 2 port 49` th...
[14:09:55] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "The logic looks good to me, just a couple of errors inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[14:10:18] <wikibugs>	 (03PS4) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[14:12:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849050 (owner: 10Arturo Borrero Gonzalez)
[14:13:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[14:13:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321312)', diff saved to https://phabricator.wikimedia.org/P36298 and previous config saved to /var/cache/conftool/dbconfig/20221025-141358-ladsgroup.json
[14:14:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance
[14:14:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance
[14:14:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[14:14:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[14:14:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36299 and previous config saved to /var/cache/conftool/dbconfig/20221025-141440-ladsgroup.json
[14:16:21] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312)
[14:16:25] <wikibugs>	 (03PS1) 10Ssingh: Depool ulsfo for cp hosts hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/849105 (https://phabricator.wikimedia.org/T317247)
[14:16:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P36300 and previous config saved to /var/cache/conftool/dbconfig/20221025-141638-ladsgroup.json
[14:16:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[14:17:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2321.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:17:31] <wikibugs>	 (03PS5) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[14:18:25] <hashar>	 !log Restarting CI Jenkins
[14:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:04] <wikibugs>	 (03PS10) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[14:20:06] <wikibugs>	 (03PS4) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229)
[14:20:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:20:27] <wikibugs>	 (03PS2) 10Andrew Bogott: haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312)
[14:21:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36301 and previous config saved to /var/cache/conftool/dbconfig/20221025-142106-ladsgroup.json
[14:21:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[14:21:48] <wikibugs>	 (03CR) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[14:21:51] <wikibugs>	 (03PS3) 10Andrew Bogott: haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312)
[14:23:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2668.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:23:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[14:24:32] <wikibugs>	 (03PS2) 10Clément Goubert: aptrepo: add component thirdparty/otelcol-contrib [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551)
[14:24:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb1016.eqiad.wmnet with reason: db1154 having hw issues
[14:25:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb1016.eqiad.wmnet with reason: db1154 having hw issues
[14:30:03] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:31:03] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms
[14:31:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P36303 and previous config saved to /var/cache/conftool/dbconfig/20221025-143144-ladsgroup.json
[14:34:01] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37737/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[14:35:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Thanks @ayounsi, was worth a shot :)  I'm thinking we probably proceed as follows:  1. Perform master switch...
[14:35:29] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] Add thirdparty/tekton repo to WMCS bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[14:35:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] realm.pp: introduce $::wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849050 (owner: 10Arturo Borrero Gonzalez)
[14:35:48] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: realm.pp: introduce $::wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849050
[14:36:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P36304 and previous config saved to /var/cache/conftool/dbconfig/20221025-143613-ladsgroup.json
[14:37:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Just a note that I should have added previously that Juniper wouldn't provide support due to JunOS 14.1 being...
[14:37:43] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:38:09] <wikibugs>	 (03PS6) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[14:41:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[14:42:49] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10phaultfinder)
[14:42:59] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms
[14:44:14] <wikibugs>	 (03CR) 10Ottomata: "Awesome!  two more nits, but +1 otherwise!" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[14:46:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321312)', diff saved to https://phabricator.wikimedia.org/P36305 and previous config saved to /var/cache/conftool/dbconfig/20221025-144651-ladsgroup.json
[14:46:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] xmldumps: Enable profile::auto_restarts::service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/832259 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:49:45] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:49:59] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:51:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P36306 and previous config saved to /var/cache/conftool/dbconfig/20221025-145120-ladsgroup.json
[14:51:47] <wikibugs>	 (03CR) 10Volans: "followup from IRC chat" [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[14:53:01] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.938 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:15] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Depool ulsfo for cp hosts hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/849105 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[14:54:14] <sukhe>	 !log running authdns-update for depooling ulsfo: Gerrit 849105
[14:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: bow to the will of the evil overlord, httpd [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108
[15:05:42] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Test adding wikifunctions.org in acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227)
[15:06:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36307 and previous config saved to /var/cache/conftool/dbconfig/20221025-150626-ladsgroup.json
[15:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:06:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:06:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36308 and previous config saved to /var/cache/conftool/dbconfig/20221025-150653-ladsgroup.json
[15:06:54] <wikibugs>	 (03CR) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[15:07:07] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37741/console" [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227) (owner: 10Vgutierrez)
[15:10:05] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] aptrepo: add component thirdparty/otelcol-contrib [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551) (owner: 10Clément Goubert)
[15:11:15] <moritzm>	 !log installing isc-dhcp security updates
[15:11:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36309 and previous config saved to /var/cache/conftool/dbconfig/20221025-151308-ladsgroup.json
[15:13:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Clean up outdated commentary on requestctl [puppet] - 10https://gerrit.wikimedia.org/r/845648 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[15:17:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite)
[15:21:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Looks good to me, it should work, even if the usage of <If> should be limited as much as possible for perf reasons IIRC. In this case is p" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108 (owner: 10Giuseppe Lavagetto)
[15:22:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[15:25:33] <claime>	 !log added component thirdparty/otelcol-contrib to apt repository
[15:25:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P36310 and previous config saved to /var/cache/conftool/dbconfig/20221025-152815-ladsgroup.json
[15:28:37] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) >>! In T321547#8341438, @Vgutierrez wrote: > nice catch @fgiunchedi. Actually I've assumed that it wasn't fired cause we didn't get the recovery on the traffic IRC channel when T321545 got fixed...
[15:29:41] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: bow to the will of the evil overlord, httpd [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108
[15:30:17] <claime>	 !log added package otelcol-contrib_0.62.1_linux_amd64.deb to component thirdparty/otelcol-contrib for bullseye-wikimedia and buster-wikimedia - T320551
[15:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:23] <stashbot>	 T320551: Package OpenTelemetry Collector as a .deb - https://phabricator.wikimedia.org/T320551
[15:33:26] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) >>! In T308677#8339287, @jbond wrote: >>>! In T308677#8338658,...
[15:36:04] <wikibugs>	 (03PS1) 10Ayounsi: Add Peering News to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/849114
[15:37:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8399
[15:38:18] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8399
[15:39:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[15:40:05] <wikibugs>	 (03CR) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[15:43:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P36311 and previous config saved to /var/cache/conftool/dbconfig/20221025-154321-ladsgroup.json
[15:43:35] <wikibugs>	 (03PS7) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[15:44:43] <wikibugs>	 (03PS1) 10Hnowlan: kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117
[15:47:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[15:48:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) I don't remember the impact of a switchover (eg. if it's none or tiny). So to be done carefully. At least the...
[15:49:45] <wikibugs>	 (03CR) 10FNegri: [V: 03+1 C: 03+2] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[15:50:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: refresh licence [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013) (owner: 10Arturo Borrero Gonzalez)
[15:54:44] <wikibugs>	 (03CR) 10Ottomata: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[15:56:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[15:57:09] <wikibugs>	 (03PS8) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[15:57:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: bow to the will of the evil overlord, httpd [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108 (owner: 10Giuseppe Lavagetto)
[15:58:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36312 and previous config saved to /var/cache/conftool/dbconfig/20221025-155828-ladsgroup.json
[15:58:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:58:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:58:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T321312)', diff saved to https://phabricator.wikimedia.org/P36313 and previous config saved to /var/cache/conftool/dbconfig/20221025-155855-ladsgroup.json
[15:59:28] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata pcc: Update deployment-puppetmaster04 public key [puppet] - 10https://gerrit.wikimedia.org/r/849121
[16:00:05] <jouncebot>	 jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:09] <wikibugs>	 (03PS2) 10Vgutierrez: hieradata pcc: Update deployment-puppetmaster04 public key [puppet] - 10https://gerrit.wikimedia.org/r/849121
[16:00:34] <vgutierrez>	 jbond: ^^ I don't know if I'm missing something there
[16:00:42] <vgutierrez>	 but it feels pretty weird to me
[16:00:45] <wikibugs>	 (03PS1) 10Btullis: Open up the postrges service to the analytics vlans [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440)
[16:01:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[16:04:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[16:04:32] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[16:04:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[16:05:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321312)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221025-160504-ladsgroup.json
[16:06:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wcqs2002.codfw.wmnet
[16:06:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs2002.codfw.wmnet
[16:06:37] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37744/console" [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[16:08:40] <wikibugs>	 (03PS2) 10Btullis: Open up the postrges service to the analytics vlans [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440)
[16:09:15] <wikibugs>	 (03CR) 10Vgutierrez: "current logic looks good to me as well" [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[16:12:06] <wikibugs>	 (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/828109 (https://phabricator.wikimedia.org/T304440)
[16:14:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:14:39] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "```" [puppet] - 10https://gerrit.wikimedia.org/r/849121 (owner: 10Vgutierrez)
[16:14:41] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:14:55] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add wikifunctions.org to exim domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack)
[16:16:54] <wikibugs>	 (03PS2) 10Vgutierrez: acme_chief: Test adding wikifunctions.org in acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227)
[16:18:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:18:37] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.810 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:20:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P36315 and previous config saved to /var/cache/conftool/dbconfig/20221025-162015-ladsgroup.json
[16:33:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[16:35:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P36316 and previous config saved to /var/cache/conftool/dbconfig/20221025-163522-ladsgroup.json
[16:36:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[16:58:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P36320 and previous config saved to /var/cache/conftool/dbconfig/20221025-165831-ladsgroup.json
[16:59:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[17:02:07] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) sub-ports are ready for cr2-eqiad  ` papaul@re0.cr2-eqiad# run show interfaces terse | match xe-1/0/* xe-1/0/1:0              down  down xe-1/0/1:1...
[17:02:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[17:02:15] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[17:02:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[17:02:45] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:47] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack trove: expose API to the public internet [puppet] - 10https://gerrit.wikimedia.org/r/849127 (https://phabricator.wikimedia.org/T319312)
[17:03:09] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326)
[17:05:01] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:05:11] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:06:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[17:08:33] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal
[17:09:47] <icinga-wm>	 PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:09:59] <icinga-wm>	 PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:10:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack trove: expose API to the public internet [puppet] - 10https://gerrit.wikimedia.org/r/849127 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[17:10:41] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal
[17:12:01] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy/ferm: rename internal ferm rules 'internal' rather than 'public' [puppet] - 10https://gerrit.wikimedia.org/r/849128 (https://phabricator.wikimedia.org/T319312)
[17:12:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:12:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[17:13:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P36321 and previous config saved to /var/cache/conftool/dbconfig/20221025-171337-ladsgroup.json
[17:13:44] <wikibugs>	 10SRE, 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks!
[17:13:56] <wikibugs>	 (03PS2) 10Andrew Bogott: haproxy/ferm: rename internal ferm rules 'internal' rather than 'public' [puppet] - 10https://gerrit.wikimedia.org/r/849128 (https://phabricator.wikimedia.org/T319312)
[17:14:33] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:14:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] haproxy/ferm: rename internal ferm rules 'internal' rather than 'public' [puppet] - 10https://gerrit.wikimedia.org/r/849128 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[17:14:39] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:43] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:17:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:17:47] <wikibugs>	 (03PS1) 10FNegri: Don't use slash in apt:repo name [puppet] - 10https://gerrit.wikimedia.org/r/849129
[17:17:57] <wikibugs>	 (03PS10) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[17:17:59] <wikibugs>	 (03PS1) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[17:18:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849129 (owner: 10FNegri)
[17:20:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[17:20:14] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Don't use slash in apt:repo name [puppet] - 10https://gerrit.wikimedia.org/r/849129 (owner: 10FNegri)
[17:21:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[17:21:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[17:23:21] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:23:57] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@d3b7785] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d3b7785]
[17:24:18] <wikibugs>	 (03PS1) 10Herron: slo_dashboard: move to one SLO/SLI per dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749)
[17:25:00] <wikibugs>	 (03PS2) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749)
[17:25:01] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@d3b7785] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d3b7785] (duration: 01m 04s)
[17:26:32] <wikibugs>	 (03CR) 10Herron: slo_dashboards: move slo definitions and defaults to files (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[17:27:42] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:28:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321312)', diff saved to https://phabricator.wikimedia.org/P36322 and previous config saved to /var/cache/conftool/dbconfig/20221025-172844-ladsgroup.json
[17:28:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:29:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:29:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T321312)', diff saved to https://phabricator.wikimedia.org/P36323 and previous config saved to /var/cache/conftool/dbconfig/20221025-172909-ladsgroup.json
[17:30:46] <wikibugs>	 (03CR) 10Herron: "note: this patch should be a noop in terms of grafana dashboard output" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[17:31:47] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] p::toolforge:harbor::prepare: upgrade harbor to v2.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/848602 (https://phabricator.wikimedia.org/T316530) (owner: 10Raymond Ndibe)
[17:38:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T321312)', diff saved to https://phabricator.wikimedia.org/P36324 and previous config saved to /var/cache/conftool/dbconfig/20221025-173817-ladsgroup.json
[17:39:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[17:40:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[17:40:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T321312)', diff saved to https://phabricator.wikimedia.org/P36325 and previous config saved to /var/cache/conftool/dbconfig/20221025-174013-ladsgroup.json
[17:42:00] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:46:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321312)', diff saved to https://phabricator.wikimedia.org/P36326 and previous config saved to /var/cache/conftool/dbconfig/20221025-174639-ladsgroup.json
[17:53:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P36327 and previous config saved to /var/cache/conftool/dbconfig/20221025-175323-ladsgroup.json
[17:54:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:55:30] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:57:40] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[17:57:40] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4025.ulsfo.wmnet
[17:57:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4025.ulsfo.wmnet` - cp4025.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[17:58:17] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:58:18] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4023.ulsfo.wmnet
[17:58:22] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4023.ulsfo.wmnet` - cp4023.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[17:58:22] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:59:38] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[17:59:44] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:00:23] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:00:34] <wikibugs>	 (03PS8) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447
[18:01:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36328 and previous config saved to /var/cache/conftool/dbconfig/20221025-180145-ladsgroup.json
[18:02:11] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, thanks for the various reviews. I'm going to merge and then we can test this (and fine-tune if needed) once the OpenSearch update is " [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff)
[18:03:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[18:04:25] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4039
[18:04:40] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4039
[18:04:44] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4041
[18:04:58] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4041
[18:05:10] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4043
[18:05:25] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4043
[18:06:42] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 (owner: 10Hnowlan)
[18:07:14] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4039.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:07:44] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4041.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:08:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P36329 and previous config saved to /var/cache/conftool/dbconfig/20221025-180830-ladsgroup.json
[18:09:08] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4043.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:11:26] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:13:27] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[18:13:40] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[18:13:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[18:16:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36330 and previous config saved to /var/cache/conftool/dbconfig/20221025-181652-ladsgroup.json
[18:18:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to restart/reboot logstash collector nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff)
[18:19:15] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[18:19:28] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) failure of provision script against cp4039  ` [1/30, retrying in 30.00s] Polling task: JID_667217070909 not completed yet: status=OK, state=Running, complete...
[18:19:58] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4041.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:20:04] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4039.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:21:54] <wikibugs>	 (03PS2) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[18:22:00] <wikibugs>	 (03PS11) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[18:22:59] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4043.mgmt.ulsfo.wmnet with reboot policy FORCED
[18:23:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T321312)', diff saved to https://phabricator.wikimedia.org/P36331 and previous config saved to /var/cache/conftool/dbconfig/20221025-182336-ladsgroup.json
[18:23:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[18:23:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[18:24:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T321312)', diff saved to https://phabricator.wikimedia.org/P36332 and previous config saved to /var/cache/conftool/dbconfig/20221025-182402-ladsgroup.json
[18:24:16] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:26:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[18:29:39] <wikibugs>	 (03PS12) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[18:29:46] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4022.ulsfo.wmnet
[18:29:56] <wikibugs>	 (03PS13) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[18:29:58] <wikibugs>	 (03PS1) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[18:30:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321312)', diff saved to https://phabricator.wikimedia.org/P36333 and previous config saved to /var/cache/conftool/dbconfig/20221025-183008-ladsgroup.json
[18:30:44] <wikibugs>	 (03CR) 10Jbond: "ready for review, examples in the follow up patches" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[18:31:21] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "Looks good. Modifying the array to add more projects should be simpler now." [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[18:31:50] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4024.ulsfo.wmnet
[18:31:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321312)', diff saved to https://phabricator.wikimedia.org/P36334 and previous config saved to /var/cache/conftool/dbconfig/20221025-183158-ladsgroup.json
[18:32:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[18:32:16] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] dumps: switch kiwix download host to master.download.kiwix.org [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[18:32:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[18:32:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T321312)', diff saved to https://phabricator.wikimedia.org/P36335 and previous config saved to /var/cache/conftool/dbconfig/20221025-183224-ladsgroup.json
[18:33:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[18:33:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond)
[18:34:16] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[18:34:16] <wikibugs>	 (03PS3) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[18:34:27] <wikibugs>	 (03PS14) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[18:34:33] <wikibugs>	 (03PS2) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[18:34:52] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:35:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4049 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:35:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4045 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:36:06] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[18:36:12] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4034 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:36:16] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4026 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:37:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp4024.ulsfo.wmnet are marked down but pooled: uploadlb_443: Servers cp4024.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:37:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (12) confd resource _srv_config-master_pybal_codfw_upload-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:37:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4007 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp4024.ulsfo.wmnet are marked down but pooled: uploadlb_443: Servers cp4024.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:37:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[18:37:40] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:37:41] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4022.ulsfo.wmnet
[18:37:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4022.ulsfo.wmnet` - cp4022.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[18:38:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[18:38:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond)
[18:38:44] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:38:44] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4024.ulsfo.wmnet
[18:38:47] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4024.ulsfo.wmnet` - cp4024.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[18:40:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321312)', diff saved to https://phabricator.wikimedia.org/P36336 and previous config saved to /var/cache/conftool/dbconfig/20221025-184006-ladsgroup.json
[18:41:10] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4026.ulsfo.wmnet
[18:41:59] <wikibugs>	 (03PS4) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[18:42:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (24) confd resource _srv_config-master_pybal_codfw_upload-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:44:43] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4028.ulsfo.wmnet
[18:45:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P36337 and previous config saved to /var/cache/conftool/dbconfig/20221025-184514-ladsgroup.json
[18:45:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4030.ulsfo.wmnet
[18:45:49] <wikibugs>	 (03PS15) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[18:45:51] <wikibugs>	 (03PS3) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[18:46:11] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[18:46:30] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4047 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:47:32] <icinga-wm>	 PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:47:59] <wikibugs>	 (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[18:49:05] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:49:06] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4026.ulsfo.wmnet
[18:49:09] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4026.ulsfo.wmnet` - cp4026.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[18:49:42] <icinga-wm>	 PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:49:52] <icinga-wm>	 PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:49:52] <icinga-wm>	 PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:49:52] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4032.ulsfo.wmnet
[18:50:20] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 (owner: 10DCausse)
[18:50:23] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[18:50:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 (owner: 10DCausse)
[18:50:26] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[18:50:32] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 (owner: 10DCausse)
[18:51:27] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF)
[18:51:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:52:28] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:52:40] <icinga-wm>	 PROBLEM - Host elastic2052 is DOWN: PING CRITICAL - Packet loss = 100%
[18:52:40] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:52:57] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:52:58] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4030.ulsfo.wmnet
[18:53:02] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4030.ulsfo.wmnet` - cp4030.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[18:53:34] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:53:36] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:53:37] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4028.ulsfo.wmnet
[18:53:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:53:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4028.ulsfo.wmnet` - cp4028.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[18:53:48] <icinga-wm>	 RECOVERY - Host elastic2052 is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms
[18:54:30] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:55:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36338 and previous config saved to /var/cache/conftool/dbconfig/20221025-185513-ladsgroup.json
[18:55:56] <icinga-wm>	 PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:55:56] <icinga-wm>	 PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:56:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp4032.ulsfo.wmnet are marked down but pooled: textlb_443: Servers cp4032.ulsfo.wmnet are marked down but pooled: testlb6_443: Servers cp4032.ulsfo.wmnet are marked down but pooled: textlb6_443: Servers cp4032.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:56:27] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[18:56:36] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4036 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:56:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:56:58] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4035 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:57:40] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:57:41] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4032.ulsfo.wmnet
[18:57:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4032.ulsfo.wmnet` - cp4032.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[18:59:17] <inflatador>	 !log bking@elastic2070 'restarting elastic7 services to apply 838141'
[18:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P36339 and previous config saved to /var/cache/conftool/dbconfig/20221025-190021-ladsgroup.json
[19:01:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:02:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:08:09] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36340 and previous config saved to /var/cache/conftool/dbconfig/20221025-191020-ladsgroup.json
[19:10:35] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:12:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:12:41] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/828109 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite)
[19:13:22] <wikibugs>	 (03PS2) 10Ssingh: cp4023: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/848423 (https://phabricator.wikimedia.org/T317244)
[19:14:49] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4023: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/848423 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[19:15:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321312)', diff saved to https://phabricator.wikimedia.org/P36341 and previous config saved to /var/cache/conftool/dbconfig/20221025-191527-ladsgroup.json
[19:15:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[19:15:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[19:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T321312)', diff saved to https://phabricator.wikimedia.org/P36342 and previous config saved to /var/cache/conftool/dbconfig/20221025-191552-ladsgroup.json
[19:21:03] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.684 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:22:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321312)', diff saved to https://phabricator.wikimedia.org/P36343 and previous config saved to /var/cache/conftool/dbconfig/20221025-192203-ladsgroup.json
[19:23:53] <wikibugs>	 (03PS1) 10Ssingh: cp4025: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849140 (https://phabricator.wikimedia.org/T317244)
[19:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321312)', diff saved to https://phabricator.wikimedia.org/P36344 and previous config saved to /var/cache/conftool/dbconfig/20221025-192526-ladsgroup.json
[19:25:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[19:25:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[19:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[19:25:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[19:25:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T321312)', diff saved to https://phabricator.wikimedia.org/P36345 and previous config saved to /var/cache/conftool/dbconfig/20221025-192556-ladsgroup.json
[19:26:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:26:43] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:27:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4025: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849140 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[19:27:55] <icinga-wm>	 PROBLEM - Host wcqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:27:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.524 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:28:17] <icinga-wm>	 RECOVERY - Host wcqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[19:28:37] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 2.514 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:29:25] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10RKemper) >>! In T320482#8309495, @Papaul wrote: > @bking this  host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk.  Ano...
[19:33:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321312)', diff saved to https://phabricator.wikimedia.org/P36347 and previous config saved to /var/cache/conftool/dbconfig/20221025-193331-ladsgroup.json
[19:34:04] <wikibugs>	 (03PS1) 10Ssingh: cp402[2468]: decommission hosts as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244)
[19:36:49] <wikibugs>	 (03PS1) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[19:37:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P36348 and previous config saved to /var/cache/conftool/dbconfig/20221025-193709-ladsgroup.json
[19:39:58] <cwhite>	 !log logstash opensearch 2.2.0 codfw transition complete T304440
[19:40:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:04] <stashbot>	 T304440: Test and upgrade OpenSearch to 2.2.0 - https://phabricator.wikimedia.org/T304440
[19:40:33] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Pcoombe)
[19:40:49] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Pcoombe)
[19:40:51] <wikibugs>	 (03PS9) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[19:43:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:44:29] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[19:47:55] <wikibugs>	 (03PS1) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129)
[19:48:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36349 and previous config saved to /var/cache/conftool/dbconfig/20221025-194838-ladsgroup.json
[19:48:48] <wikibugs>	 (03PS2) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129)
[19:49:45] <wikibugs>	 (03PS3) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129)
[19:50:35] <wikibugs>	 (03PS4) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129)
[19:50:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[19:52:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P36350 and previous config saved to /var/cache/conftool/dbconfig/20221025-195216-ladsgroup.json
[19:53:40] <wikibugs>	 (03PS2) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[19:53:52] <wikibugs>	 (03CR) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[19:53:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:54:05] <wikibugs>	 (03PS3) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[19:54:07] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:54:48] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[19:54:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[19:55:02] <wikibugs>	 (03PS1) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605)
[19:56:45] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:56:58] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4038
[19:57:14] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4038
[19:57:15] <wikibugs>	 (03PS2) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605)
[19:58:17] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "wmcs: format and refactor maintain-dbusers.py" [puppet] - 10https://gerrit.wikimedia.org/r/849150
[19:58:27] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:58:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T2000).
[20:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <Jdlrobson>	 present
[20:00:30] <cjming>	 hi ! i can deploy
[20:00:49] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED
[20:01:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking)
[20:01:57] <cjming>	 Jdlrobson: is that one test failing a problem?
[20:02:37] <Jdlrobson>	 looking..
[20:03:03] <Jdlrobson>	 some linting issues.. fixing..
[20:03:10] <cjming>	 mostly linting errors
[20:03:24] <wikibugs>	 (03PS3) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605)
[20:03:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36351 and previous config saved to /var/cache/conftool/dbconfig/20221025-200344-ladsgroup.json
[20:03:47] <wikibugs>	 (03PS4) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605)
[20:05:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs: format and refactor maintain-dbusers.py" [puppet] - 10https://gerrit.wikimedia.org/r/849150 (owner: 10Andrew Bogott)
[20:05:39] <wikibugs>	 (03PS4) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[20:06:26] <Jdlrobson>	 cmjohnson1: ok that should do it
[20:07:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:07:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321312)', diff saved to https://phabricator.wikimedia.org/P36352 and previous config saved to /var/cache/conftool/dbconfig/20221025-200723-ladsgroup.json
[20:07:23] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[20:07:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[20:07:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[20:07:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T321312)', diff saved to https://phabricator.wikimedia.org/P36353 and previous config saved to /var/cache/conftool/dbconfig/20221025-200746-ladsgroup.json
[20:08:12] <cjming>	 Jdlrobson: almost - wg prefix?
[20:08:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking)
[20:08:48] * urbanecm waves and sees B&C is being taken care of, thanks cjming
[20:08:54] <cjming>	 np!
[20:09:26] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:09:44] <Jdlrobson>	 :D
[20:10:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking)
[20:10:13] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4040
[20:10:25] <wikibugs>	 (03PS5) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[20:10:27] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4040
[20:10:31] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4042
[20:10:46] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4042
[20:10:50] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4044
[20:11:04] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4044
[20:11:06] <wikibugs>	 (03PS1) 10Raymond Ndibe: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849166 (https://phabricator.wikimedia.org/T304040)
[20:11:07] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4046
[20:11:22] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4046
[20:11:26] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4048
[20:11:41] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4048
[20:13:34] <wikibugs>	 (03Merged) 10jenkins-bot: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking)
[20:13:35] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[20:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321312)', diff saved to https://phabricator.wikimedia.org/P36354 and previous config saved to /var/cache/conftool/dbconfig/20221025-201343-ladsgroup.json
[20:13:58] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4034.ulsfo.wmnet
[20:14:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:14:27] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cp4034.ulsfo.wmnet
[20:14:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:14:48] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4034.ulsfo.wmnet
[20:14:49] <wikibugs>	 (03PS6) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[20:15:13] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4036.ulsfo.wmnet
[20:15:55] <wikibugs>	 (03PS7) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223)
[20:15:58] <Jdlrobson>	 sorry cjming ^
[20:16:04] <Jdlrobson>	 had to make some updates to one of the assets
[20:16:14] <cjming>	 oh - whoops
[20:17:31] <cjming>	 hmm -- quick Q urbanecm: if i've already run scap backport but the patch got updated during it, will it just fail and i can rerun?
[20:17:43] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10XenoRyet)
[20:18:01] <urbanecm>	 cjming: if the patch did not get merge yet, it will hang forever
[20:18:08] <urbanecm>	 it should work if you +2 it manually
[20:18:11] <urbanecm>	 (the patch, i mean)
[20:18:17] <cjming>	 cool - thanks
[20:18:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: add rsyslog::input::files to send apache logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[20:18:51] <cjming>	 Jdlrobson: gtg?
[20:18:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321312)', diff saved to https://phabricator.wikimedia.org/P36355 and previous config saved to /var/cache/conftool/dbconfig/20221025-201852-ladsgroup.json
[20:18:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[20:19:03] <Jdlrobson>	 gtg!
[20:19:10] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:19:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[20:19:13] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: mnt-nfs-dumps\x2dlabstore1006.wikimedia.org.mount,mnt-nfs-dumps\x2dlabstore1007.wikimedia.org.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T321312)', diff saved to https://phabricator.wikimedia.org/P36356 and previous config saved to /var/cache/conftool/dbconfig/20221025-201918-ladsgroup.json
[20:20:07] <wikibugs>	 (03Merged) 10jenkins-bot: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:20:32] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:849142|Update remaining Wikipedia logos (T319223)]]
[20:20:38] <stashbot>	 T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223
[20:20:56] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:849142|Update remaining Wikipedia logos (T319223)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:21:09] <cjming>	 Jdlrobson: wanna check a debug server?
[20:21:37] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[20:21:38] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[20:21:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:21:50] <Jdlrobson>	 cjming: yes please
[20:21:53] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4037 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[20:22:11] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:22:21] <Jdlrobson>	 LGTM cjming !
[20:22:35] <cjming>	 great - going live
[20:23:28] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:23:30] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4036.ulsfo.wmnet
[20:23:38] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4036.ulsfo.wmnet` - cp4036.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[20:24:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849166 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[20:24:05] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:24:05] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4034.ulsfo.wmnet
[20:24:08] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4034.ulsfo.wmnet` - cp4034.ulsfo.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[20:24:57] <Jdlrobson>	 Thanks a lot @cjming !
[20:24:59] <Jdlrobson>	 looking great
[20:25:22] <wikibugs>	 (03PS1) 10Dzahn: rsyslog: forward miscweb logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090)
[20:25:39] <cjming>	 np!
[20:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321312)', diff saved to https://phabricator.wikimedia.org/P36357 and previous config saved to /var/cache/conftool/dbconfig/20221025-202538-ladsgroup.json
[20:26:36] <urbanecm>	 cjming: did it work?
[20:27:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (48) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:27:15] <cjming>	 urbanecm: yes! thanks - another Q tho - i think i need to purge a few of the files - can i run purgeList on a directory?
[20:27:20] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:849142|Update remaining Wikipedia logos (T319223)]] (duration: 06m 48s)
[20:27:26] <stashbot>	 T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223
[20:28:08] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED
[20:28:14] <urbanecm>	 cjming: purgeList.php needs URIs to purge, unfortunately.
[20:28:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:28:32] <cjming>	 bummer - ok, i'll run each one
[20:28:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P36358 and previous config saved to /var/cache/conftool/dbconfig/20221025-202849-ladsgroup.json
[20:28:57] <icinga-wm>	 PROBLEM - Host cp4034.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:28:57] <icinga-wm>	 PROBLEM - Host cp4036.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:29:04] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED
[20:29:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:29:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:29:26] <urbanecm>	 cjming: you can do something like `ls /srv/mediawiki-staging/static/images/mobile/copyright/ | sed 's#^#https://en.wikipedia.org/static/images/mobile/copyright/#g' | mwscript purgeList.php` though
[20:30:00] <cjming>	 fancy - ok i'll try that
[20:30:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:31:55] <cjming>	 thanks urbanecm: works perfectly
[20:32:00] <urbanecm>	 great!
[20:34:24] <icinga-wm>	 PROBLEM - Host wcqs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:35:16] <icinga-wm>	 RECOVERY - Host wcqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms
[20:36:09] <koi>	 hi cjming and Jdlrobson, there's a problem with zhwiki's tagline, actually we don't want to have it looks like this as our local consensus is not to change the tagline for 20 years celebration
[20:36:56] <koi>	 and currently the tagline looks pretty small https://zh.wikipedia.org/wiki/?useskin=vector-2022
[20:37:26] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED
[20:39:21] <cjming>	 Jdlrobson: do you have the previous file? we can revert that one per koi's note above
[20:39:27] <wikibugs>	 (03CR) 10Dzahn: "compiling on C:profile::rsyslog::kafka_shipper which is a lot of hosts" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[20:39:53] <koi>	 cjming: I'm writing a patch now, please wait a sec
[20:40:07] <cjming>	 koi: great - standing by
[20:40:27] <cjming>	 Jdlrobson: nvm - i'll wait for koi's patch
[20:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36359 and previous config saved to /var/cache/conftool/dbconfig/20221025-204045-ladsgroup.json
[20:41:36] <wikibugs>	 (03PS1) 10Stang: Revert tagline of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849171
[20:41:58] <koi>	 cjming: uploaded ^
[20:42:09] <cjming>	 yup - on it
[20:43:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849171 (owner: 10Stang)
[20:43:13] <koi>	 cjming: hang on, I forgot one part
[20:43:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P36360 and previous config saved to /var/cache/conftool/dbconfig/20221025-204356-ladsgroup.json
[20:44:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert tagline of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849171 (owner: 10Stang)
[20:44:14] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:44:31] <cjming>	 koi: whoops - i seem to be trigger-happy today -- can it be a follow up?
[20:44:35] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:849171|Revert tagline of zhwiki]]
[20:44:46] <koi>	 ok, trying
[20:44:58] <logmsgbot>	 !log cjming@deploy1002 cjming and stang: Backport for [[gerrit:849171|Revert tagline of zhwiki]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:45:31] <cjming>	 koi: it's up on the debug servers ^^ if you want to verify
[20:46:48] <wikibugs>	 (03PS1) 10Raymond Ndibe: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849173 (https://phabricator.wikimedia.org/T304040)
[20:47:04] <icinga-wm>	 PROBLEM - Host wcqs2003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:47:44] <wikibugs>	 (03PS1) 10Stang: Revert tagline of zhwiki (cont.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849174
[20:48:05] <koi>	 cjming: posted another one, the continue of the first patch
[20:48:19] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[20:48:54] <icinga-wm>	 RECOVERY - Host wcqs2003 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms
[20:48:54] <cjming>	 ok - since your first revert already merged, i'll sync that one and do your 2nd patch - then purge it
[20:48:59] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4050
[20:49:14] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4050
[20:49:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4052
[20:49:22] <koi>	 ok!
[20:49:33] <wikibugs>	 (03CR) 10Stef Dunlap: "Would you mind reviewing my patch or suggesting someone who might be able to review it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap)
[20:49:36] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4052
[20:50:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:50:40] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/37748/" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[20:51:54] <wikibugs>	 (03PS1) 10Jdlrobson: WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223)
[20:53:11] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849173 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[20:53:46] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:849171|Revert tagline of zhwiki]] (duration: 09m 11s)
[20:54:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849174 (owner: 10Stang)
[20:54:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert tagline of zhwiki (cont.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849174 (owner: 10Stang)
[20:55:09] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:849174|Revert tagline of zhwiki (cont.)]]
[20:55:32] <logmsgbot>	 !log cjming@deploy1002 cjming and stang: Backport for [[gerrit:849174|Revert tagline of zhwiki (cont.)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:55:35] <cjming>	 koi: can you check on a debug server?
[20:55:39] <koi>	 looking
[20:55:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849173 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[20:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36361 and previous config saved to /var/cache/conftool/dbconfig/20221025-205551-ladsgroup.json
[20:55:54] <koi>	 cjming: LGTM
[20:56:00] <cjming>	 great - syncing
[20:56:54] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:58:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:58:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:59:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:59:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321312)', diff saved to https://phabricator.wikimedia.org/P36362 and previous config saved to /var/cache/conftool/dbconfig/20221025-205902-ladsgroup.json
[20:59:59] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:849174|Revert tagline of zhwiki (cont.)]] (duration: 04m 49s)
[21:00:22] <cjming>	 koi: should be live - purged your files
[21:01:15] <koi>	 thanks!
[21:01:19] <cjming>	 np!
[21:01:21] <cjming>	 !log end of UTC late backport window
[21:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:16] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH)
[21:03:10] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:03:51] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4033.ulsfo.wmnet
[21:03:55] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:03:56] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4035.ulsfo.wmnet
[21:04:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:05:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:05:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:05:35] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:06:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:08:11] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[21:08:23] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[21:09:45] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[21:10:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321312)', diff saved to https://phabricator.wikimedia.org/P36363 and previous config saved to /var/cache/conftool/dbconfig/20221025-211058-ladsgroup.json
[21:11:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[21:11:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[21:11:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36364 and previous config saved to /var/cache/conftool/dbconfig/20221025-211125-ladsgroup.json
[21:11:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:12:01] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:12:02] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4035.ulsfo.wmnet
[21:12:05] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4035.ulsfo.wmnet` - cp4035.ulsfo.wmnet (**FAIL**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[21:12:38] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:12:40] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4033.ulsfo.wmnet
[21:12:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4033.ulsfo.wmnet` - cp4033.ulsfo.wmnet (**FAIL**)   - Downtimed host on Icinga/Alertmanager   - Fo...
[21:15:32] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:17:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36365 and previous config saved to /var/cache/conftool/dbconfig/20221025-211730-ladsgroup.json
[21:20:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[21:20:20] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[21:20:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[21:21:24] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:28:07] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:29:12] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4033.ulsfo.wmnet
[21:31:45] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:32:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36366 and previous config saved to /var/cache/conftool/dbconfig/20221025-213236-ladsgroup.json
[21:34:28] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[21:34:54] <wikibugs>	 (03PS1) 10BBlack: Clean up trafficserver::tls and related [puppet] - 10https://gerrit.wikimedia.org/r/849178
[21:34:56] <wikibugs>	 (03PS1) 10BBlack: Remove cache::(text|upload)_envoy remnants [puppet] - 10https://gerrit.wikimedia.org/r/849179
[21:34:58] <wikibugs>	 (03PS1) 10BBlack: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180
[21:37:40] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:37:43] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[21:38:36] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Clean up outdated commentary on requestctl [puppet] - 10https://gerrit.wikimedia.org/r/845648 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[21:38:54] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:38:55] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4033.ulsfo.wmnet
[21:38:59] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4033.ulsfo.wmnet` - cp4033.ulsfo.wmnet (**FAIL**)   - //Host not found on Icinga, unable to downti...
[21:41:55] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[21:42:50] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4035.ulsfo.wmnet
[21:43:06] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:43:20] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4044.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:45:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4009
[21:46:01] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs4009
[21:46:43] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4010
[21:46:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[21:47:11] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs4010
[21:47:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36367 and previous config saved to /var/cache/conftool/dbconfig/20221025-214743-ladsgroup.json
[21:50:14] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[21:50:20] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[21:50:50] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([lvs4010.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[21:51:36] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:51:37] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4035.ulsfo.wmnet
[21:51:40] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4035.ulsfo.wmnet` - cp4035.ulsfo.wmnet (**FAIL**)   - //Host not found on Icinga, unable to downti...
[21:53:44] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([lvs4009.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[21:57:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks as well" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[21:57:41] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4044.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:59:04] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[22:00:47] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4051
[22:01:02] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4051
[22:01:02] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:01:28] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4007 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[22:01:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:02:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[22:02:23] <wikibugs>	 (03PS1) 10BCornwall: WIP: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[22:02:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[22:02:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36368 and previous config saved to /var/cache/conftool/dbconfig/20221025-220249-ladsgroup.json
[22:03:44] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:04:17] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:04:53] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:07:52] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH)
[22:08:36] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Seems like the surrounding lines in the block are tab characters while, so this brings it in line with the expectation (for the block, at " [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe)
[22:11:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4006 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4033.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[22:11:46] <wikibugs>	 (03CR) 10Dzahn: "just adding history here. once upon a time all files, including .erb templates, in the puppet repo had tab indentation. Then we switched t" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe)
[22:12:00] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS
[22:14:32] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[22:16:58] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:17:01] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:17:08] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:17:41] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:17:44] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:18:12] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:18:16] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:18:46] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:19:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:20:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I can find some logs on logstash by filtering for miscweb* host names, but they are only the puppet runs (puppet: unchanged), I don't see " [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[22:20:17] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:24:31] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:24:40] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:24:42] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:25:24] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4007 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4033.ulsfo.wmnet, cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[22:25:25] <wikibugs>	 10SRE, 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10RobH) 05Open→03Resolved
[22:26:03] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[22:28:28] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4040
[22:28:30] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4040
[22:29:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:30:38] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:32:28] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[22:32:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4051.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:33:52] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:34:34] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:34:51] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:35:07] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:35:56] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10BCornwall) Perhaps this is because the severity is set to warning rather than critical?
[22:36:00] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:36:16] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:36:29] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:43:05] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:43:15] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:44:42] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:45:03] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4051.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:45:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[22:45:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:45:43] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:48:32] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:51:37] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED
[22:51:56] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED
[23:07:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Dumps: remove a bunch of references to labstore1006 and labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/849192 (https://phabricator.wikimedia.org/T309346)
[23:07:04] <wikibugs>	 (03PS1) 10Andrew Bogott: rsync-via-primary.sh: replace labstore with clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346)
[23:07:38] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH)
[23:08:15] <wikibugs>	 (03CR) 10Andrew Bogott: rsync-via-primary.sh: replace labstore with clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott)
[23:10:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[23:12:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[23:13:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[23:17:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[23:22:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I had to create a saved search first. Think I'm good for now. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn)
[23:24:17] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops-collab: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10Dzahn) 05Open→03Resolved This is resolved. logs are now available here:  https://logstash.wi...
[23:43:46] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ssingh)
[23:57:21] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @BTullis Yea, that is accetable. It's still progress over managing the group members...
[23:58:04] <wikibugs>	 (03PS2) 10Ssingh: cp402[2468], cp403[0246]: decommission hosts as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244)
[23:58:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cp402[2468], cp403[0246]: decommission hosts as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[23:59:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency