[00:02:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36151 and previous config saved to /var/cache/conftool/dbconfig/20221025-000257-ladsgroup.json [00:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321312)', diff saved to https://phabricator.wikimedia.org/P36152 and previous config saved to /var/cache/conftool/dbconfig/20221025-000904-ladsgroup.json [00:09:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:09:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:09:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36153 and previous config saved to /var/cache/conftool/dbconfig/20221025-000931-ladsgroup.json [00:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36154 and previous config saved to /var/cache/conftool/dbconfig/20221025-001057-ladsgroup.json [00:11:33] (03PS1) 10Andrew Bogott: C:ceph: ensure that the ceph keyring folder gets the correct owner/group [puppet] - 10https://gerrit.wikimedia.org/r/848557 [00:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36155 and previous config saved to /var/cache/conftool/dbconfig/20221025-001705-ladsgroup.json [00:18:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P36156 and previous config saved to /var/cache/conftool/dbconfig/20221025-001804-ladsgroup.json [00:18:59] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36157 and previous config saved to /var/cache/conftool/dbconfig/20221025-003211-ladsgroup.json [00:33:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P36158 and previous config saved to /var/cache/conftool/dbconfig/20221025-003310-ladsgroup.json [00:43:26] (03CR) 10Jdlrobson: [C: 03+1] "🤩🤩🤩" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [00:45:30] (03CR) 10Jdlrobson: [C: 03+1] Move wmgSiteLogoVariants to logos.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [00:47:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36159 and previous config saved to /var/cache/conftool/dbconfig/20221025-004718-ladsgroup.json [00:48:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36160 and previous config saved to /var/cache/conftool/dbconfig/20221025-004817-ladsgroup.json [00:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36161 and previous config saved to /var/cache/conftool/dbconfig/20221025-005332-ladsgroup.json [01:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36162 and previous config saved to /var/cache/conftool/dbconfig/20221025-010225-ladsgroup.json [01:07:46] (03PS6) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [01:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P36163 and previous config saved to /var/cache/conftool/dbconfig/20221025-010839-ladsgroup.json [01:09:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36164 and previous config saved to /var/cache/conftool/dbconfig/20221025-010943-ladsgroup.json [01:09:56] (03CR) 10CI reject: [V: 04-1] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [01:23:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P36165 and previous config saved to /var/cache/conftool/dbconfig/20221025-012345-ladsgroup.json [01:24:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P36166 and previous config saved to /var/cache/conftool/dbconfig/20221025-012449-ladsgroup.json [01:30:30] (03PS7) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [01:32:38] (03CR) 10CI reject: [V: 04-1] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [01:35:44] (03PS8) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36167 and previous config saved to /var/cache/conftool/dbconfig/20221025-013852-ladsgroup.json [01:38:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [01:39:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [01:39:17] (03CR) 10Xcollazo: "Incorporated review comments. Please re-review." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [01:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36168 and previous config saved to /var/cache/conftool/dbconfig/20221025-013917-ladsgroup.json [01:39:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P36169 and previous config saved to /var/cache/conftool/dbconfig/20221025-013956-ladsgroup.json [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36170 and previous config saved to /var/cache/conftool/dbconfig/20221025-014536-ladsgroup.json [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36171 and previous config saved to /var/cache/conftool/dbconfig/20221025-015502-ladsgroup.json [01:55:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [01:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [01:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36172 and previous config saved to /var/cache/conftool/dbconfig/20221025-015528-ladsgroup.json [01:55:47] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0200) [02:00:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P36173 and previous config saved to /var/cache/conftool/dbconfig/20221025-020043-ladsgroup.json [02:01:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36174 and previous config saved to /var/cache/conftool/dbconfig/20221025-020150-ladsgroup.json [02:03:59] (03PS1) 10Raymond Ndibe: p::toolforge:harbor::prepare: upgrade harbor to v2.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/848602 (https://phabricator.wikimedia.org/T316530) [02:04:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:05:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:05:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.7 [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/848095 (https://phabricator.wikimedia.org/T320512) [02:07:44] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.7 [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/848095 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P36175 and previous config saved to /var/cache/conftool/dbconfig/20221025-021550-ladsgroup.json [02:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P36176 and previous config saved to /var/cache/conftool/dbconfig/20221025-021656-ladsgroup.json [02:20:21] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:21:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:24:09] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.7 [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/848095 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [02:24:29] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:25:11] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36177 and previous config saved to /var/cache/conftool/dbconfig/20221025-023056-ladsgroup.json [02:31:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [02:31:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [02:31:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321312)', diff saved to https://phabricator.wikimedia.org/P36178 and previous config saved to /var/cache/conftool/dbconfig/20221025-023120-ladsgroup.json [02:31:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P36179 and previous config saved to /var/cache/conftool/dbconfig/20221025-023203-ladsgroup.json [02:32:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:32:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:32:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321312)', diff saved to https://phabricator.wikimedia.org/P36180 and previous config saved to /var/cache/conftool/dbconfig/20221025-023733-ladsgroup.json [02:40:09] (03CR) 10Cwhite: [C: 03+1] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [02:47:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321312)', diff saved to https://phabricator.wikimedia.org/P36181 and previous config saved to /var/cache/conftool/dbconfig/20221025-024709-ladsgroup.json [02:47:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:49:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:50:55] (03CR) 10Cwhite: miscweb: add rsyslog::input::files to send apache logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [02:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P36182 and previous config saved to /var/cache/conftool/dbconfig/20221025-025239-ladsgroup.json [02:56:49] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0300) [03:00:45] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:12] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848627 (https://phabricator.wikimedia.org/T320512) [03:01:14] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848627 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [03:01:56] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848627 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [03:02:24] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.7 refs T320512 [03:02:29] T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512 [03:03:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:03:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:03:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:04:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:07:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P36183 and previous config saved to /var/cache/conftool/dbconfig/20221025-030745-ladsgroup.json [03:09:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:10:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:10:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:11:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:17:17] (03PS1) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:17:51] (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:19:34] (03PS2) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:20:08] (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:20:39] PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1108) taken on 2022-10-25 03:08:31 is 899 MiB, but the previous one was 1.2 GiB, a change of -25.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:20:57] (03PS3) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:21:31] (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321312)', diff saved to https://phabricator.wikimedia.org/P36184 and previous config saved to /var/cache/conftool/dbconfig/20221025-032252-ladsgroup.json [03:22:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [03:23:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [03:23:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T321312)', diff saved to https://phabricator.wikimedia.org/P36185 and previous config saved to /var/cache/conftool/dbconfig/20221025-032316-ladsgroup.json [03:29:02] (03PS4) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:29:36] (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:30:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321312)', diff saved to https://phabricator.wikimedia.org/P36186 and previous config saved to /var/cache/conftool/dbconfig/20221025-033039-ladsgroup.json [03:34:18] (03PS5) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:34:52] (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dirs [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:38:18] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.7 refs T320512 (duration: 35m 54s) [03:38:23] T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512 [03:39:19] (03PS6) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:39:53] (03CR) 10CI reject: [V: 04-1] profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:40:16] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.5 (duration: 01m 56s) [03:41:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:42:24] (03PS7) 10Andrew Bogott: profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) [03:45:09] (03CR) 10Andrew Bogott: "I didn't have much luck enumerating all the mgrs but this seems to work for the one that counts ($hostname)" [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [03:45:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P36187 and previous config saved to /var/cache/conftool/dbconfig/20221025-034546-ladsgroup.json [03:48:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:48:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:54:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:57:15] PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:59:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:00:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P36188 and previous config saved to /var/cache/conftool/dbconfig/20221025-040052-ladsgroup.json [04:01:19] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:03:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:03:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:04:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:15:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321312)', diff saved to https://phabricator.wikimedia.org/P36189 and previous config saved to /var/cache/conftool/dbconfig/20221025-041558-ladsgroup.json [04:18:01] PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:32:13] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:12:55] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:18:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T321177 [05:19:02] T321177: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T321177 [05:19:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T321177 [05:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1138 with weight 0 T321177', diff saved to https://phabricator.wikimedia.org/P36190 and previous config saved to /var/cache/conftool/dbconfig/20221025-051933-ladsgroup.json [05:24:21] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:43] <_joe_> !log restarting pybal on lvs1020 to test cookbook mechanism [05:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:44:17] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: an-worker1084, analytics1074, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, phab1004, releases1002, releases2002, relforge1003, relforge1004, stat1005, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_ru [05:44:17] s [05:47:17] (03PS2) 10Ladsgroup: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/844015 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot) [05:47:23] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/844015 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot) [05:47:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:48:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:56:18] <_joe_> !log restarting pybal again on lvs1020, again for testing [05:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0600). [06:00:10] o/ [06:00:14] let's go [06:00:17] o/ [06:00:33] !log Starting s4 eqiad failover from db1160 to db1138 - T321177 [06:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:38] T321177: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T321177 [06:00:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T321177', diff saved to https://phabricator.wikimedia.org/P36191 and previous config saved to /var/cache/conftool/dbconfig/20221025-060043-ladsgroup.json [06:01:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1138 to s4 primary and set section read-write T321177', diff saved to https://phabricator.wikimedia.org/P36192 and previous config saved to /var/cache/conftool/dbconfig/20221025-060118-ladsgroup.json [06:02:44] it should be mostly done [06:04:25] (03PS2) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot) [06:05:06] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s4-master alias (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/844016 (https://phabricator.wikimedia.org/T321177) (owner: 10Gerrit maintenance bot) [06:06:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1160 T321177', diff saved to https://phabricator.wikimedia.org/P36193 and previous config saved to /var/cache/conftool/dbconfig/20221025-060643-ladsgroup.json [06:06:49] T321177: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T321177 [06:09:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:09:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:13:51] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:14:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:14:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:15:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:15:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:15:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36194 and previous config saved to /var/cache/conftool/dbconfig/20221025-061552-ladsgroup.json [06:16:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:16:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:16:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321312)', diff saved to https://phabricator.wikimedia.org/P36195 and previous config saved to /var/cache/conftool/dbconfig/20221025-061621-ladsgroup.json [06:17:09] (03PS1) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [06:17:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36196 and previous config saved to /var/cache/conftool/dbconfig/20221025-061710-ladsgroup.json [06:20:24] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [06:20:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: an-worker1084, analytics1074, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, phab1004, releases1002, releases2002, relforge1003, relforge1004, stat1005, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_ru [06:20:39] s [06:23:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36198 and previous config saved to /var/cache/conftool/dbconfig/20221025-062318-ladsgroup.json [06:23:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321312)', diff saved to https://phabricator.wikimedia.org/P36199 and previous config saved to /var/cache/conftool/dbconfig/20221025-062337-ladsgroup.json [06:25:17] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:12] (03PS1) 10Muehlenhoff: Removed kerberos principal for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/849007 [06:32:53] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 799 hosts [06:33:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 799 hosts [06:33:29] (03CR) 10Muehlenhoff: [C: 03+2] Removed kerberos principal for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/849007 (owner: 10Muehlenhoff) [06:33:42] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bruno Scarone out of all services on: 1206 hosts [06:34:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bruno Scarone out of all services on: 1206 hosts [06:36:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 7795 [06:38:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P36200 and previous config saved to /var/cache/conftool/dbconfig/20221025-063824-ladsgroup.json [06:38:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7795 [06:38:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36201 and previous config saved to /var/cache/conftool/dbconfig/20221025-063843-ladsgroup.json [06:42:17] (03PS1) 10Muehlenhoff: Make ganeti4005 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/849010 (https://phabricator.wikimedia.org/T317247) [06:49:13] (03PS2) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [06:50:49] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: further improvements for logging. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848241 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto) [06:51:55] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4005 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/849010 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [06:52:18] (03CR) 10Elukey: [C: 03+2] admin_ng: update Istio settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/848344 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [06:52:29] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [06:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P36202 and previous config saved to /var/cache/conftool/dbconfig/20221025-065330-ladsgroup.json [06:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36203 and previous config saved to /var/cache/conftool/dbconfig/20221025-065350-ladsgroup.json [06:55:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [06:56:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [06:56:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [06:56:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [06:57:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [06:58:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:58:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:58:48] (03PS1) 10Marostegui: db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849012 [06:59:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:59:29] (03CR) 10Marostegui: [C: 03+2] db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/849012 (owner: 10Marostegui) [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36204 and previous config saved to /var/cache/conftool/dbconfig/20221025-070004-root.json [07:00:20] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Marostegui) I am repooling this host now. [07:04:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:09] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [07:04:13] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [07:05:22] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:28] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [07:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36205 and previous config saved to /var/cache/conftool/dbconfig/20221025-070837-ladsgroup.json [07:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321312)', diff saved to https://phabricator.wikimedia.org/P36206 and previous config saved to /var/cache/conftool/dbconfig/20221025-070856-ladsgroup.json [07:09:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [07:09:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [07:09:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T321312)', diff saved to https://phabricator.wikimedia.org/P36207 and previous config saved to /var/cache/conftool/dbconfig/20221025-070922-ladsgroup.json [07:09:38] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [07:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36208 and previous config saved to /var/cache/conftool/dbconfig/20221025-071053-ladsgroup.json [07:15:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36209 and previous config saved to /var/cache/conftool/dbconfig/20221025-071509-root.json [07:15:32] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37710/console" [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [07:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321312)', diff saved to https://phabricator.wikimedia.org/P36210 and previous config saved to /var/cache/conftool/dbconfig/20221025-071652-ladsgroup.json [07:17:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [07:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P36211 and previous config saved to /var/cache/conftool/dbconfig/20221025-072600-ladsgroup.json [07:26:41] (03CR) 10David Caro: [V: 03+1 C: 03+2] profile::ceph::mon: explicitly create mgr keyring dir [puppet] - 10https://gerrit.wikimedia.org/r/848632 (https://phabricator.wikimedia.org/T321514) (owner: 10Andrew Bogott) [07:27:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [07:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36212 and previous config saved to /var/cache/conftool/dbconfig/20221025-073014-root.json [07:31:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1 [07:31:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1 [07:31:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36213 and previous config saved to /var/cache/conftool/dbconfig/20221025-073159-ladsgroup.json [07:38:21] !log installing 5.10.149-2 update on bullseye hosts (regression doesn't concern any of our servers, but still makes sense to have further reboots move to the latest kernel) [07:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P36215 and previous config saved to /var/cache/conftool/dbconfig/20221025-074106-ladsgroup.json [07:44:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:swift::storage: drop unused udev rule [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) (owner: 10Jbond) [07:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36216 and previous config saved to /var/cache/conftool/dbconfig/20221025-074519-root.json [07:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36217 and previous config saved to /var/cache/conftool/dbconfig/20221025-074705-ladsgroup.json [07:48:22] (03PS1) 10Jbond: C:swift::storage: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/849013 (https://phabricator.wikimedia.org/T308677) [07:51:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:53] (03CR) 10Muehlenhoff: "I'd say we should simply create sub team-specific roles? Such as role::insetup::infrastructure_foundations, role::insetup::data_persistenc" [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond) [07:56:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36218 and previous config saved to /var/cache/conftool/dbconfig/20221025-075613-ladsgroup.json [07:56:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:56:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36219 and previous config saved to /var/cache/conftool/dbconfig/20221025-075638-ladsgroup.json [07:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36220 and previous config saved to /var/cache/conftool/dbconfig/20221025-075657-ladsgroup.json [07:59:00] (03PS1) 10Jbond: P:cumin::master: drop low-traffic from PoP sites [puppet] - 10https://gerrit.wikimedia.org/r/849014 [07:59:18] (03PS1) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) [08:00:04] jnuche and hashar: gettimeofday() says it's time for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T0800) [08:00:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37711/console" [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond) [08:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36221 and previous config saved to /var/cache/conftool/dbconfig/20221025-080024-root.json [08:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36222 and previous config saved to /var/cache/conftool/dbconfig/20221025-080153-ladsgroup.json [08:02:09] !log drain ganeti1023 for eventual reimage T311687 [08:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321312)', diff saved to https://phabricator.wikimedia.org/P36223 and previous config saved to /var/cache/conftool/dbconfig/20221025-080212-ladsgroup.json [08:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:14] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [08:02:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [08:02:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [08:02:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36224 and previous config saved to /var/cache/conftool/dbconfig/20221025-080238-ladsgroup.json [08:03:15] (03PS2) 10Jbond: P:cumin::master: drop low-traffic from PoP sites [puppet] - 10https://gerrit.wikimedia.org/r/849014 [08:03:43] (03PS2) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) [08:04:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37712/console" [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond) [08:07:04] (03PS3) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [08:07:08] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849017 (https://phabricator.wikimedia.org/T320512) [08:07:10] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849017 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [08:07:57] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849017 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [08:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:10:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36225 and previous config saved to /var/cache/conftool/dbconfig/20221025-081007-ladsgroup.json [08:10:16] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [08:10:24] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond) [08:10:46] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [08:11:42] (03PS3) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) [08:12:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:cumin::master: drop low-traffic from PoP sites [puppet] - 10https://gerrit.wikimedia.org/r/849014 (owner: 10Jbond) [08:12:23] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.7 refs T320512 [08:12:28] T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512 [08:12:39] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) >>! In T308677#8339622, @jbond wrote: >> luckily puppet doesn'... [08:13:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:13:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT leases) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:14:14] (03CR) 10Jelto: ""Job succeeded" - https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api/-/jobs/27102" [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [08:14:26] (03PS4) 10Elukey: coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) [08:15:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36226 and previous config saved to /var/cache/conftool/dbconfig/20221025-081529-root.json [08:15:42] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P36227 and previous config saved to /var/cache/conftool/dbconfig/20221025-081700-ladsgroup.json [08:17:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:17:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:17:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [08:18:02] (03CR) 10Elukey: "I checked differences with https://github.com/coredns/helm, there are some but mostly related to:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [08:18:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:19:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [08:20:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [08:21:10] (03CR) 10Elukey: "Helm lint currently adds the new option for endpointslices anyway, I am wondering if KubeVersion is not what we expect in CI." [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [08:24:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:25:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36228 and previous config saved to /var/cache/conftool/dbconfig/20221025-082514-ladsgroup.json [08:26:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1 [08:26:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1 [08:29:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:29:47] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1065.eqiad.wmnet [08:30:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36229 and previous config saved to /var/cache/conftool/dbconfig/20221025-083034-root.json [08:32:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P36230 and previous config saved to /var/cache/conftool/dbconfig/20221025-083206-ladsgroup.json [08:33:04] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) So far, all of the inbound messages to this mailing list have been held for moderat... [08:36:45] (03CR) 10Volans: "Nice to see a new cookbook! I've left some comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [08:36:49] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [08:40:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36232 and previous config saved to /var/cache/conftool/dbconfig/20221025-084020-ladsgroup.json [08:41:44] (03PS1) 10Jbond: insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 [08:42:19] (03CR) 10CI reject: [V: 04-1] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [08:42:23] (03CR) 10Jbond: [C: 03+2] O:insetup: drop role contact I/F (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond) [08:44:34] (03PS7) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [08:44:36] (03PS1) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) [08:45:09] (03CR) 10CI reject: [V: 04-1] dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:45:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [08:45:30] (03CR) 10Jbond: [C: 03+2] C:swift::storage: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/849013 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [08:45:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36233 and previous config saved to /var/cache/conftool/dbconfig/20221025-084541-root.json [08:46:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [08:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36234 and previous config saved to /var/cache/conftool/dbconfig/20221025-084713-ladsgroup.json [08:48:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "After merging this you need:" [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [08:48:43] (03PS8) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [08:48:45] (03PS2) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) [08:49:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36235 and previous config saved to /var/cache/conftool/dbconfig/20221025-084929-ladsgroup.json [08:49:44] (03CR) 10CI reject: [V: 04-1] dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:51:22] (03PS5) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) [08:54:23] (03PS9) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [08:54:25] (03PS3) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) [08:55:02] (03CR) 10FNegri: [C: 03+2] Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [08:55:18] (03PS1) 10Muehlenhoff: Swap ganeti4003 with ganeti4005 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) [08:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321312)', diff saved to https://phabricator.wikimedia.org/P36236 and previous config saved to /var/cache/conftool/dbconfig/20221025-085527-ladsgroup.json [08:55:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [08:55:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [08:55:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:55:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:55:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36237 and previous config saved to /var/cache/conftool/dbconfig/20221025-085558-ladsgroup.json [08:57:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [08:57:35] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [08:58:42] (03PS1) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024 [09:00:08] (03PS1) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) [09:01:03] (03CR) 10Filippo Giunchedi: "LGTM, deploy is only puppet-merge, no further action needed" [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [09:01:06] (03CR) 10Filippo Giunchedi: [C: 03+1] Swap ganeti4003 with ganeti4005 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [09:01:08] (03PS1) 10Jbond: insetup_noferm: add traffic as the owner of this role [puppet] - 10https://gerrit.wikimedia.org/r/849026 [09:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36238 and previous config saved to /var/cache/conftool/dbconfig/20221025-090213-ladsgroup.json [09:02:36] (03CR) 10Muehlenhoff: site.pp: move insetup hosts to the team specific role (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond) [09:04:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P36239 and previous config saved to /var/cache/conftool/dbconfig/20221025-090436-ladsgroup.json [09:06:30] (03CR) 10Jbond: [V: 03+2] "I think its best to override CI on this one. I think it makes more sense to just use the system::role that comes from role::insetup" [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [09:10:32] (03PS2) 10Jbond: insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 [09:11:07] (03CR) 10CI reject: [V: 04-1] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [09:11:55] (03PS1) 10Jcrespo: mariadb: Add production-side filters for CampaignEvents extension tables [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) [09:13:48] (03CR) 10MVernon: "My slight concern with this is that these two jobs currently produce a ping on #wikimedia-data-persistence on IRC; with this change, I thi" [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:13:51] (03CR) 10Btullis: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:14:17] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [09:14:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [09:15:40] (03CR) 10Jcrespo: "Based on https://phabricator.wikimedia.org/P35370 there are no fully private tables to add to manifests/realm.pp but please check." [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo) [09:16:10] (03CR) 10MVernon: Use generic 'Check systemd state' alert to catch timer failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:17:16] (03CR) 10Giuseppe Lavagetto: Add cookbook to restart pybal (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [09:17:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36240 and previous config saved to /var/cache/conftool/dbconfig/20221025-091720-ladsgroup.json [09:17:56] (03PS4) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [09:19:27] (03PS2) 10Muehlenhoff: Swap ganeti4002/ganeti4003 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) [09:19:39] (03PS1) 10Volans: CORE_DATACENTERS: use the wmflib constant [cookbooks] - 10https://gerrit.wikimedia.org/r/849031 [09:19:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P36241 and previous config saved to /var/cache/conftool/dbconfig/20221025-091942-ladsgroup.json [09:21:59] (03PS5) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [09:22:45] (03CR) 10Filippo Giunchedi: [C: 03+2] analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:22:49] (03PS1) 10David Caro: p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) [09:23:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Swap ganeti4002/ganeti4003 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [09:23:44] (03PS2) 10David Caro: p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) [09:24:12] (03CR) 10Filippo Giunchedi: Use generic 'Check systemd state' alert to catch timer failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:25:20] (03PS1) 10FNegri: Fix reprepro config for thirdparty/tekton [puppet] - 10https://gerrit.wikimedia.org/r/849033 (https://phabricator.wikimedia.org/T317143) [09:25:37] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37713/console" [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro) [09:25:54] (03CR) 10CI reject: [V: 04-1] p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro) [09:26:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix reprepro config for thirdparty/tekton [puppet] - 10https://gerrit.wikimedia.org/r/849033 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [09:26:38] (03PS3) 10David Caro: p::ceph:mon: set permissions if mgr key parents [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) [09:26:47] (03CR) 10David Caro: [V: 03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro) [09:27:12] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [09:27:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/849031 (owner: 10Volans) [09:28:55] (03CR) 10Muehlenhoff: [C: 03+2] Swap ganeti4002/ganeti4003 for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/849023 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [09:28:59] (03PS4) 10David Caro: p::ceph:mon: set permissions if mgr key parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) [09:29:05] (03CR) 10Volans: "Replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [09:30:00] (03CR) 10Marostegui: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo) [09:30:34] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) @Dzahn - would it be acceptable for us to use the exim aliases file to forward the... [09:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36243 and previous config saved to /var/cache/conftool/dbconfig/20221025-093226-ladsgroup.json [09:32:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36244 and previous config saved to /var/cache/conftool/dbconfig/20221025-093449-ladsgroup.json [09:34:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [09:35:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [09:35:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36245 and previous config saved to /var/cache/conftool/dbconfig/20221025-093513-ladsgroup.json [09:36:09] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond) [09:36:11] (03PS1) 10Volans: tox.ini: explain why there are old Python versions [cookbooks] - 10https://gerrit.wikimedia.org/r/849034 (https://phabricator.wikimedia.org/T289222) [09:36:26] (03PS2) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024 [09:36:43] !log drain ganeti4002 for eventual decom T317247 [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:48] T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 [09:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36246 and previous config saved to /var/cache/conftool/dbconfig/20221025-094122-ladsgroup.json [09:44:16] (03CR) 10Muehlenhoff: "Two more, missed them before." [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond) [09:47:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36247 and previous config saved to /var/cache/conftool/dbconfig/20221025-094733-ladsgroup.json [09:47:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:47:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:48:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36248 and previous config saved to /var/cache/conftool/dbconfig/20221025-094800-ladsgroup.json [09:49:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36249 and previous config saved to /var/cache/conftool/dbconfig/20221025-094921-ladsgroup.json [09:51:42] (03CR) 10JMeybohm: coredns: support up to upstream version 1.8.7 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [09:51:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:52:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36250 and previous config saved to /var/cache/conftool/dbconfig/20221025-095527-ladsgroup.json [09:55:31] (03CR) 10Jbond: "thanks done" [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond) [09:55:46] (03PS3) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024 [09:56:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P36251 and previous config saved to /var/cache/conftool/dbconfig/20221025-095629-ladsgroup.json [09:57:02] (03CR) 10JMeybohm: [C: 03+1] coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [09:57:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [09:57:35] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:58:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond) [09:58:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/849026 (owner: 10Jbond) [09:59:02] (03CR) 10Jbond: [V: 03+2] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [09:59:14] (03PS4) 10Jbond: site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024 [09:59:21] (03PS2) 10Jbond: insetup_noferm: add traffic as the owner of this role [puppet] - 10https://gerrit.wikimedia.org/r/849026 [10:00:29] (03PS1) 10Marostegui: wmnet: Failover all m*-master [dns] - 10https://gerrit.wikimedia.org/r/849039 (https://phabricator.wikimedia.org/T321312) [10:02:46] (03PS1) 10Elukey: admin_ng: add a Istio vs and retry settings on ml-serve for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/849040 (https://phabricator.wikimedia.org/T320374) [10:03:37] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:03:37] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [10:03:43] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849026 (owner: 10Jbond) [10:05:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:06:45] (03CR) 10Volans: [C: 04-1] "I don't mind the addition of the new roles, it makes sense to me although a bit verbose. Just make sure that DCOps is onboard with it as t" [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [10:07:12] (03CR) 10Vgutierrez: Add cookbook to restart pybal (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [10:07:16] (03CR) 10FNegri: [C: 03+2] Fix reprepro config for thirdparty/tekton [puppet] - 10https://gerrit.wikimedia.org/r/849033 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [10:07:36] (03CR) 10Marostegui: [C: 03+1] "Amir, this requires sanitarium puppet runs + mariadb restarts." [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza) [10:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P36252 and previous config saved to /var/cache/conftool/dbconfig/20221025-101034-ladsgroup.json [10:11:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P36253 and previous config saved to /var/cache/conftool/dbconfig/20221025-101135-ladsgroup.json [10:11:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:39] (03CR) 10Volans: [C: 03+1] insetup: add team specific insetup roles to ease ownership identification (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [10:22:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:25:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P36254 and previous config saved to /var/cache/conftool/dbconfig/20221025-102540-ladsgroup.json [10:26:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [10:26:31] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321312)', diff saved to https://phabricator.wikimedia.org/P36255 and previous config saved to /var/cache/conftool/dbconfig/20221025-102642-ladsgroup.json [10:26:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:27:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:27:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:27:03] (03CR) 10Muehlenhoff: [C: 03+1] insetup: add team specific insetup roles to ease ownership identification (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [10:27:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:27:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T321312)', diff saved to https://phabricator.wikimedia.org/P36256 and previous config saved to /var/cache/conftool/dbconfig/20221025-102724-ladsgroup.json [10:28:37] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:44] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) [10:29:59] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) p:05Triage→03Medium [10:30:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:35] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10Vgutierrez) [10:31:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.314 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [10:32:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [10:32:51] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover all m*-master [dns] - 10https://gerrit.wikimedia.org/r/849039 (https://phabricator.wikimedia.org/T321312) (owner: 10Marostegui) [10:33:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 2.332 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321312)', diff saved to https://phabricator.wikimedia.org/P36257 and previous config saved to /var/cache/conftool/dbconfig/20221025-103346-ladsgroup.json [10:40:06] (03CR) 10Daimona Eaytoy: [C: 03+1] "Yup, LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo) [10:40:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:40:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P36258 and previous config saved to /var/cache/conftool/dbconfig/20221025-104047-ladsgroup.json [10:40:53] (03PS1) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [10:41:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [10:41:25] (03CR) 10CI reject: [V: 04-1] Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [10:41:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:32] (03PS2) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [10:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36259 and previous config saved to /var/cache/conftool/dbconfig/20221025-104303-ladsgroup.json [10:43:06] (03CR) 10CI reject: [V: 04-1] Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [10:43:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37714/console" [puppet] - 10https://gerrit.wikimedia.org/r/841908 (owner: 10JMeybohm) [10:43:30] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] dragonfly::dfdaemon: Fix dummy ssl_paths object [puppet] - 10https://gerrit.wikimedia.org/r/841908 (owner: 10JMeybohm) [10:43:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [10:48:29] (03PS3) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [10:48:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P36260 and previous config saved to /var/cache/conftool/dbconfig/20221025-104852-ladsgroup.json [10:49:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:06] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37715/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [10:51:58] 10SRE, 10Observability-Metrics, 10serviceops, 10Maps (Kartotherian), 10Patch-For-Review: Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10hnowlan) [10:53:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [10:54:12] (03PS4) 10FNegri: Add new tekton package to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [10:55:11] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37716/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [10:58:08] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Zabe) [10:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36261 and previous config saved to /var/cache/conftool/dbconfig/20221025-105810-ladsgroup.json [11:03:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:04:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P36262 and previous config saved to /var/cache/conftool/dbconfig/20221025-110359-ladsgroup.json [11:11:39] (03PS1) 10Jbond: aptrepo: create a component to backport python3.9 to unblock CI [puppet] - 10https://gerrit.wikimedia.org/r/849049 (https://phabricator.wikimedia.org/T289222) [11:12:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36263 and previous config saved to /var/cache/conftool/dbconfig/20221025-111316-ladsgroup.json [11:16:37] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) What you did for accepting non-members looks good to me. I haven't seen any held... [11:19:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321312)', diff saved to https://phabricator.wikimedia.org/P36264 and previous config saved to /var/cache/conftool/dbconfig/20221025-111906-ladsgroup.json [11:19:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:19:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:19:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T321312)', diff saved to https://phabricator.wikimedia.org/P36265 and previous config saved to /var/cache/conftool/dbconfig/20221025-111930-ladsgroup.json [11:19:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover all m*-master [dns] - 10https://gerrit.wikimedia.org/r/849039 (https://phabricator.wikimedia.org/T321312) (owner: 10Marostegui) [11:22:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/849049 (https://phabricator.wikimedia.org/T289222) (owner: 10Jbond) [11:23:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] insetup: add team specific insetup roles to ease ownership identification [puppet] - 10https://gerrit.wikimedia.org/r/849020 (owner: 10Jbond) [11:23:32] (03CR) 10Jbond: [C: 03+2] site.pp: move insetup hosts to the team specific role [puppet] - 10https://gerrit.wikimedia.org/r/849024 (owner: 10Jbond) [11:23:36] (03CR) 10Jbond: [C: 03+2] insetup_noferm: add traffic as the owner of this role [puppet] - 10https://gerrit.wikimedia.org/r/849026 (owner: 10Jbond) [11:24:31] (03CR) 10Arturo Borrero Gonzalez: Add new tekton package to WMCS bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [11:25:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321312)', diff saved to https://phabricator.wikimedia.org/P36266 and previous config saved to /var/cache/conftool/dbconfig/20221025-112527-ladsgroup.json [11:25:36] (03CR) 10Jbond: [C: 03+2] aptrepo: create a component to backport python3.9 to unblock CI [puppet] - 10https://gerrit.wikimedia.org/r/849049 (https://phabricator.wikimedia.org/T289222) (owner: 10Jbond) [11:28:17] (03PS1) 10Arturo Borrero Gonzalez: realm.pp: introduce $::wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849050 [11:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36267 and previous config saved to /var/cache/conftool/dbconfig/20221025-112822-ladsgroup.json [11:28:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [11:28:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Add production-side filters for CampaignEvents extension tables [puppet] - 10https://gerrit.wikimedia.org/r/849029 (https://phabricator.wikimedia.org/T318595) (owner: 10Jcrespo) [11:28:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [11:28:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36268 and previous config saved to /var/cache/conftool/dbconfig/20221025-112848-ladsgroup.json [11:29:17] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37717/console" [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [11:33:29] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [11:34:12] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org [11:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36269 and previous config saved to /var/cache/conftool/dbconfig/20221025-113455-ladsgroup.json [11:35:22] (03PS2) 10Ladsgroup: Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza) [11:35:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza) [11:37:26] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: refresh licence [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013) [11:37:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org [11:37:51] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: refresh licence [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013) [11:38:08] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2002.wikimedia.org [11:40:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P36270 and previous config saved to /var/cache/conftool/dbconfig/20221025-114034-ladsgroup.json [11:41:39] (03CR) 10Jelto: [V: 03+1 C: 03+2] docker_registry_ha: Require JWT to have ref_protected claim set to true [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [11:41:52] PROBLEM - IPMI Sensor Status on kafka-logging1004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:43:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti4002.ulsfo.wmnet with reason: Remove from cluster for eventual decom [11:43:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti4002.ulsfo.wmnet with reason: Remove from cluster for eventual decom [11:46:08] (03PS1) 10Muehlenhoff: Remove ganeti4002 from Puppet for decom [puppet] - 10https://gerrit.wikimedia.org/r/849054 (https://phabricator.wikimedia.org/T317247) [11:49:05] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti4002 from Puppet for decom [puppet] - 10https://gerrit.wikimedia.org/r/849054 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [11:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36271 and previous config saved to /var/cache/conftool/dbconfig/20221025-115002-ladsgroup.json [11:54:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 132203 [11:55:20] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4002.ulsfo.wmnet [11:55:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P36272 and previous config saved to /var/cache/conftool/dbconfig/20221025-115540-ladsgroup.json [11:57:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 132203 [11:59:57] (03CR) 10David Caro: "we should not be installing tekton-cli directly, if needed, it should be pulled by toolforge-cli, so we will want to configure the reposit" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:00:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:05:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36273 and previous config saved to /var/cache/conftool/dbconfig/20221025-120509-ladsgroup.json [12:07:00] (03CR) 10Elukey: [V: 03+2 C: 03+2] coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [12:09:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti4002.ulsfo.wmnet [12:09:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti4002.ulsfo.wmnet` - ganeti4002.ulsfo.wmnet (**PASS**)... [12:10:37] 10SRE, 10Wikimedia-Mailing-lists: "The FOO list has N moderation requests waiting." notifications can't be turned off in Mailman 3 - https://phabricator.wikimedia.org/T284107 (10GreenReaper) Even if you only get one spam email a day, it basically means you are quite likely to get two emails about it rather tha... [12:10:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321312)', diff saved to https://phabricator.wikimedia.org/P36274 and previous config saved to /var/cache/conftool/dbconfig/20221025-121047-ladsgroup.json [12:10:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:11:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:11:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36275 and previous config saved to /var/cache/conftool/dbconfig/20221025-121111-ladsgroup.json [12:16:31] (03CR) 10Elukey: [C: 03+2] admin_ng: add a Istio vs and retry settings on ml-serve for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/849040 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [12:16:42] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10MoritzMuehlenhoff) I have setup ganeti4005 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected. @RobH : I've als... [12:17:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36276 and previous config saved to /var/cache/conftool/dbconfig/20221025-121730-ladsgroup.json [12:18:24] I restarted mailman services, let's see if it fixes [12:19:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:19:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36277 and previous config saved to /var/cache/conftool/dbconfig/20221025-122015-ladsgroup.json [12:22:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2519 [12:23:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2519 [12:23:35] (03PS1) 10Muehlenhoff: Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 [12:26:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [12:26:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [12:27:11] (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 [12:28:20] (03PS6) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [12:29:04] (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/849063 [12:29:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [12:29:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [12:30:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T321312)', diff saved to https://phabricator.wikimedia.org/P36278 and previous config saved to /var/cache/conftool/dbconfig/20221025-123001-ladsgroup.json [12:30:40] (03PS1) 10Ladsgroup: Add add_el_to_domain_index_T318605.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849065 (https://phabricator.wikimedia.org/T318605) [12:31:49] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [12:32:04] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: virgin is 27 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [12:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P36279 and previous config saved to /var/cache/conftool/dbconfig/20221025-123236-ladsgroup.json [12:33:02] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1023.eqiad.wmnet with reason: Remove from cluster for eventual reimage [12:33:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1023.eqiad.wmnet with reason: Remove from cluster for eventual reimage [12:36:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321312)', diff saved to https://phabricator.wikimedia.org/P36280 and previous config saved to /var/cache/conftool/dbconfig/20221025-123615-ladsgroup.json [12:36:44] (03PS5) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [12:37:12] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [12:37:26] (03CR) 10CI reject: [V: 04-1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:38:00] 10SRE, 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10fgiunchedi) I can't reproduce the issue ATM via https://grafana.wikimedia.org/goto/14wkdONVz?orgId=1 however your intuition is correct: the interval for... [12:38:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1023.eqiad.wmnet with OS bullseye [12:38:51] !log Restarting CI Jenkins [12:38:53] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS bullseye [12:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:39] !log drain ganeti1015 for eventual reimage T311687 [12:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:47] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [12:42:51] oh thanks systemd for killing jenkins grr [12:44:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:44:49] (03PS1) 10Filippo Giunchedi: timer::job: remove monitoring_enabled [puppet] - 10https://gerrit.wikimedia.org/r/849088 (https://phabricator.wikimedia.org/T303253) [12:45:14] (03PS1) 10Clément Goubert: aptrepo: add component thirdparty/otelcol-contrib [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551) [12:46:30] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37721/console" [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551) (owner: 10Clément Goubert) [12:46:50] (03PS6) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [12:47:24] (03Abandoned) 10Andrew Bogott: C:ceph: ensure that the ceph keyring folder gets the correct owner/group [puppet] - 10https://gerrit.wikimedia.org/r/848557 (owner: 10Andrew Bogott) [12:47:35] (03CR) 10CI reject: [V: 04-1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:47:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P36281 and previous config saved to /var/cache/conftool/dbconfig/20221025-124743-ladsgroup.json [12:51:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.579 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P36283 and previous config saved to /var/cache/conftool/dbconfig/20221025-125122-ladsgroup.json [12:51:59] (03PS7) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [12:52:54] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37724/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:52:58] (03CR) 10CI reject: [V: 04-1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:53:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1023.eqiad.wmnet with reason: host reimage [12:53:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1023.eqiad.wmnet with reason: host reimage [12:58:42] 10SRE, 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10Vgutierrez) oh got it, thanks @fgiunchedi @BCornwall please update the min step to 2m in the dashboard.. maybe adding a hidden variable and referencing... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T1300). [13:00:05] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T1300) [13:00:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:27] o/ [13:00:57] (03PS8) 10FNegri: Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) [13:02:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321312)', diff saved to https://phabricator.wikimedia.org/P36284 and previous config saved to /var/cache/conftool/dbconfig/20221025-130249-ladsgroup.json [13:02:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:03:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:03:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T321312)', diff saved to https://phabricator.wikimedia.org/P36285 and previous config saved to /var/cache/conftool/dbconfig/20221025-130314-ladsgroup.json [13:06:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P36286 and previous config saved to /var/cache/conftool/dbconfig/20221025-130628-ladsgroup.json [13:07:02] hi [13:07:12] I have an addition to the backport window [13:09:25] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849092 (https://phabricator.wikimedia.org/T304549) [13:09:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321312)', diff saved to https://phabricator.wikimedia.org/P36287 and previous config saved to /var/cache/conftool/dbconfig/20221025-130931-ladsgroup.json [13:11:04] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) I've been scratching my head a little on this because the alert seemingly *has* fired: {F35624931} {F35624934} Yet I can't find any notification ATM [13:11:06] (03PS1) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [13:11:11] (03PS1) 10Muehlenhoff: Fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/849094 [13:11:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to drbd [13:11:18] koi: I don't have enough time to backport your change, I'm afraid. Are any of the other deployers around? [13:12:11] don't know .. [13:12:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1023.eqiad.wmnet with OS bullseye [13:13:43] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:849092|GrowthExperiments: Enable link recommendation for aswiki (T304549)]] [13:14:11] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:849092|GrowthExperiments: Enable link recommendation for aswiki (T304549)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:14:35] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:15:16] (03PS1) 10Klausman: wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [13:15:49] o/ [13:16:02] (03CR) 10Jbond: [C: 03+1] Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 (owner: 10Muehlenhoff) [13:16:07] * Lucas_WMDE looks [13:16:15] (03CR) 10Jbond: [C: 03+1] Set profile::contacts::role_contacts for gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/849063 (owner: 10Muehlenhoff) [13:16:24] oh, still the big logos change :sweat_sm [13:16:33] * Lucas_WMDE needs to learn how emojis work in irccloud [13:16:40] 😅 is what I meant [13:16:49] (03CR) 10Jbond: [C: 03+1] Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 (owner: 10Muehlenhoff) [13:16:54] (03CR) 10Btullis: [C: 03+1] Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 (owner: 10Muehlenhoff) [13:17:06] oh wait, no it’s a different one [13:17:16] (03CR) 10Btullis: [C: 03+1] Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 (owner: 10Muehlenhoff) [13:17:19] (03CR) 10Volans: "couple of questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:17:22] (03CR) 10CI reject: [V: 04-1] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [13:17:28] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::contacts::role_contacts for turnilo/staging [puppet] - 10https://gerrit.wikimedia.org/r/849059 (owner: 10Muehlenhoff) [13:17:31] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013) (owner: 10Arturo Borrero Gonzalez) [13:17:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:17:42] (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/849061 (owner: 10Muehlenhoff) [13:17:49] yeah the one yesterday was processed [13:17:54] (03CR) 10Vgutierrez: [C: 03+1] "LGTM! Thanks Cwhite!" [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [13:17:59] nice [13:18:05] okay, looking [13:18:14] (03CR) 10Majavah: wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [13:18:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:18:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:18:56] (03CR) 10MVernon: [C: 03+1] Use generic 'Check systemd state' alert to catch timer failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [13:19:13] (03CR) 10Klausman: wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [13:19:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:19:29] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:849092|GrowthExperiments: Enable link recommendation for aswiki (T304549)]] (duration: 05m 45s) [13:20:09] (03PS2) 10Filippo Giunchedi: Use generic 'Check systemd state' alert to catch timer failures [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) [13:20:24] Lucas_WMDE: done with backporting my patch. [13:20:29] (03PS2) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [13:20:33] ok, I’m reviewing koi’s patch [13:20:37] (03CR) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:21:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to drbd [13:21:17] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:21:29] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [13:21:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321312)', diff saved to https://phabricator.wikimedia.org/P36288 and previous config saved to /var/cache/conftool/dbconfig/20221025-132135-ladsgroup.json [13:21:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:21:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:22:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T321312)', diff saved to https://phabricator.wikimedia.org/P36289 and previous config saved to /var/cache/conftool/dbconfig/20221025-132201-ladsgroup.json [13:22:09] (03PS3) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [13:22:50] (03CR) 10Lucas Werkmeister (WMDE): Move wmgSiteLogoVariants to logos.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:22:54] koi: just a tiny comment [13:22:57] looks good otherwise [13:23:30] (03CR) 10Lucas Werkmeister (WMDE): Move wmgSiteLogoVariants to logos.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:23:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [13:24:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [13:24:22] (03PS2) 10Stang: Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) [13:24:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P36290 and previous config saved to /var/cache/conftool/dbconfig/20221025-132438-ladsgroup.json [13:24:50] (03CR) 10Stang: Move wmgSiteLogoVariants to logos.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:25:03] (03PS3) 10Lucas Werkmeister (WMDE): Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:25:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:26:49] (03Merged) 10jenkins-bot: Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:27:12] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:848552|Move wmgSiteLogoVariants to logos.php (T308620 T321519)]] [13:27:14] (03PS2) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [13:27:19] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [13:27:19] T321519: Define wmgSiteLogoVariants in logos/config.yaml - https://phabricator.wikimedia.org/T321519 [13:27:36] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:848552|Move wmgSiteLogoVariants to logos.php (T308620 T321519)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:27:43] koi: ^ [13:27:49] I guess there’s nothing to test – just check nothing’s broken? [13:28:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:28:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:28:30] (03CR) 10Elukey: coredns: support up to upstream version 1.8.7 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [13:28:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321312)', diff saved to https://phabricator.wikimedia.org/P36291 and previous config saved to /var/cache/conftool/dbconfig/20221025-132839-ladsgroup.json [13:28:54] Lucas_WMDE: I tested all five involved projects, and there's no changes for the logo variants, so LGTM [13:28:59] \o/ thanks [13:29:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:31:01] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:31:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:33] (03PS3) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [13:31:42] (03CR) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:32:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [13:32:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [13:33:00] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:848552|Move wmgSiteLogoVariants to logos.php (T308620 T321519)]] (duration: 05m 47s) [13:33:06] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [13:33:07] T321519: Define wmgSiteLogoVariants in logos/config.yaml - https://phabricator.wikimedia.org/T321519 [13:33:21] anything else to deploy? [13:33:52] (03CR) 10Marostegui: [C: 03+1] Add add_el_to_domain_index_T318605.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849065 (https://phabricator.wikimedia.org/T318605) (owner: 10Ladsgroup) [13:33:54] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [13:34:28] !log UTC afternoon backport+config window done [13:34:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to plain [13:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:00] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [13:35:02] (03CR) 10Muehlenhoff: [C: 03+2] Fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/849094 (owner: 10Muehlenhoff) [13:35:05] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@5575605]: Update restbase to c1d391c7 [13:35:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to plain [13:36:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:37:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance [13:38:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance [13:38:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Use generic 'Check systemd state' alert to catch timer failures [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [13:39:43] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [13:39:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P36292 and previous config saved to /var/cache/conftool/dbconfig/20221025-133944-ladsgroup.json [13:40:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:11] PROBLEM - MariaDB Replica IO: s5 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:17] PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:21] PROBLEM - MariaDB Replica IO: s8 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:40] Amir1: ^ [13:42:47] That comes from db1154 [13:42:56] I'm restarting it [13:43:04] I think I haven't downtimed clouddbs [13:43:06] sorry [13:43:22] it should be up in one or two minutes [13:43:39] marostegui: will it page? [13:43:46] I think it pages WMCS [13:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P36293 and previous config saved to /var/cache/conftool/dbconfig/20221025-134345-ladsgroup.json [13:43:49] But I am not fully sure [13:44:17] ok [13:45:07] (03CR) 10Herron: dispatch: introduce profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:46:17] PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:47:07] I have a feeling it's not coming back online, wait for a couple of minutes more and see [13:47:36] I am connected to the console, let me see [13:50:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.426 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:51:50] Amir1: We probably need a task with ops-eqiad [13:52:03] I am trying to send hard resets and poweroff/on but it doesn't seem to be doing anything [13:52:13] okay [13:52:26] let me downtime clouddbs [13:53:20] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@5575605]: Update restbase to c1d391c7 (duration: 18m 14s) [13:53:25] 10SRE, 10Patch-For-Review, 10User-jbond: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10ayounsi) Some notes/thoughts from a chat with @jbond: * Based on P36282 and except `Data Engineering,Machine Learning` all servers have 1 clear team owner * `role_contacts` has been ext... [13:53:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1020-1021].eqiad.wmnet with reason: db1154 having hw issues [13:53:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1020-1021].eqiad.wmnet with reason: db1154 having hw issues [13:54:08] (03CR) 10JMeybohm: [C: 03+1] coredns: support up to upstream version 1.8.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/849015 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [13:54:30] (03PS1) 10Andrew Bogott: Neutron, glance, cinder, keystone: Move api firewall rules into haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/849098 (https://phabricator.wikimedia.org/T319312) [13:54:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321312)', diff saved to https://phabricator.wikimedia.org/P36294 and previous config saved to /var/cache/conftool/dbconfig/20221025-135451-ladsgroup.json [13:54:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [13:55:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [13:55:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T321312)', diff saved to https://phabricator.wikimedia.org/P36295 and previous config saved to /var/cache/conftool/dbconfig/20221025-135515-ladsgroup.json [13:55:30] marostegui: are you creating the ticket or should I? [13:55:54] Amir1: Please do it, I am still trying if I can get it back [13:56:01] sure [13:56:13] (03CR) 10Herron: [C: 03+1] "Seems most of the unresolved comments have been addressed by now, maybe one or two minor things remaining, LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [13:56:55] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:57:34] 10ops-eqiad: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Ladsgroup) [13:57:48] (03CR) 10Giuseppe Lavagetto: Add cookbook to restart pybal (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [13:58:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P36296 and previous config saved to /var/cache/conftool/dbconfig/20221025-135852-ladsgroup.json [13:59:07] !log test bouncing VC port on asw2-d-eqiad [13:59:08] (03PS1) 10Majavah: openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503) [13:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:13] (03PS7) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [13:59:34] 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Marostegui) p:05Triage→03High I have tried to power it off and then back on from the console but I was getting weird outputs like: ` racadm>>serveraction powerstatus Server power status: OFF racadm>>... [13:59:47] (03CR) 10CI reject: [V: 04-1] openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [13:59:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:15] (03PS2) 10Majavah: openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503) [14:00:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 8.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321312)', diff saved to https://phabricator.wikimedia.org/P36297 and previous config saved to /var/cache/conftool/dbconfig/20221025-140131-ladsgroup.json [14:02:01] (03CR) 10Andrew Bogott: [C: 03+2] Neutron, glance, cinder, keystone: Move api firewall rules into haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/849098 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:03:16] (03CR) 10Andrew Bogott: [C: 03+2] openstack: wmf_sink: set accept header for enc deletion calls [puppet] - 10https://gerrit.wikimedia.org/r/849099 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [14:04:03] (03Abandoned) 10Btullis: Add cumin aliases for dse-k8s in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/843932 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [14:04:48] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a simple mechanism for creating postgresql users and databases [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [14:05:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) As data point I tried: `asw2-d-eqiad# run request virtual-chassis vc-port set pic-slot 0 member 2 port 49` th... [14:09:55] (03CR) 10Volans: [C: 04-1] "The logic looks good to me, just a couple of errors inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [14:10:18] (03PS4) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [14:12:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849050 (owner: 10Arturo Borrero Gonzalez) [14:13:43] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [14:13:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321312)', diff saved to https://phabricator.wikimedia.org/P36298 and previous config saved to /var/cache/conftool/dbconfig/20221025-141358-ladsgroup.json [14:14:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:14:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:14:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:14:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:14:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36299 and previous config saved to /var/cache/conftool/dbconfig/20221025-141440-ladsgroup.json [14:16:21] (03PS1) 10Andrew Bogott: haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) [14:16:25] (03PS1) 10Ssingh: Depool ulsfo for cp hosts hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/849105 (https://phabricator.wikimedia.org/T317247) [14:16:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P36300 and previous config saved to /var/cache/conftool/dbconfig/20221025-141638-ladsgroup.json [14:16:55] (03CR) 10CI reject: [V: 04-1] haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:17:22] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2321.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:17:31] (03PS5) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [14:18:25] !log Restarting CI Jenkins [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:04] (03PS10) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [14:20:06] (03PS4) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) [14:20:18] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:20:27] (03PS2) 10Andrew Bogott: haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) [14:21:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36301 and previous config saved to /var/cache/conftool/dbconfig/20221025-142106-ladsgroup.json [14:21:16] (03CR) 10CI reject: [V: 04-1] haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:21:48] (03CR) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [14:21:51] (03PS3) 10Andrew Bogott: haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) [14:23:14] PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2668.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:23:29] (03CR) 10Andrew Bogott: [C: 03+2] haproxy: correct srange syntax for internal apis [puppet] - 10https://gerrit.wikimedia.org/r/849104 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:24:32] (03PS2) 10Clément Goubert: aptrepo: add component thirdparty/otelcol-contrib [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551) [14:24:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb1016.eqiad.wmnet with reason: db1154 having hw issues [14:25:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb1016.eqiad.wmnet with reason: db1154 having hw issues [14:30:03] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:31:03] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms [14:31:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P36303 and previous config saved to /var/cache/conftool/dbconfig/20221025-143144-ladsgroup.json [14:34:01] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37737/console" [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [14:35:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Thanks @ayounsi, was worth a shot :) I'm thinking we probably proceed as follows: 1. Perform master switch... [14:35:29] (03CR) 10FNegri: [V: 03+1] Add thirdparty/tekton repo to WMCS bastions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [14:35:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] realm.pp: introduce $::wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849050 (owner: 10Arturo Borrero Gonzalez) [14:35:48] (03PS2) 10Arturo Borrero Gonzalez: realm.pp: introduce $::wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849050 [14:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P36304 and previous config saved to /var/cache/conftool/dbconfig/20221025-143613-ladsgroup.json [14:37:19] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Just a note that I should have added previously that Juniper wouldn't provide support due to JunOS 14.1 being... [14:37:43] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:09] (03PS6) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [14:41:34] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [14:42:49] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10phaultfinder) [14:42:59] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [14:44:14] (03CR) 10Ottomata: "Awesome! two more nits, but +1 otherwise!" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [14:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321312)', diff saved to https://phabricator.wikimedia.org/P36305 and previous config saved to /var/cache/conftool/dbconfig/20221025-144651-ladsgroup.json [14:46:56] (03CR) 10Muehlenhoff: [C: 03+2] xmldumps: Enable profile::auto_restarts::service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/832259 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:49:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P36306 and previous config saved to /var/cache/conftool/dbconfig/20221025-145120-ladsgroup.json [14:51:47] (03CR) 10Volans: "followup from IRC chat" [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [14:53:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.938 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:39] (03CR) 10Ssingh: [C: 03+2] Depool ulsfo for cp hosts hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/849105 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [14:54:14] !log running authdns-update for depooling ulsfo: Gerrit 849105 [14:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:56] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: bow to the will of the evil overlord, httpd [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108 [15:05:42] (03PS1) 10Vgutierrez: acme_chief: Test adding wikifunctions.org in acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227) [15:06:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36307 and previous config saved to /var/cache/conftool/dbconfig/20221025-150626-ladsgroup.json [15:06:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:06:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:06:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36308 and previous config saved to /var/cache/conftool/dbconfig/20221025-150653-ladsgroup.json [15:06:54] (03CR) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [15:07:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37741/console" [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227) (owner: 10Vgutierrez) [15:10:05] (03CR) 10Clément Goubert: [C: 03+2] aptrepo: add component thirdparty/otelcol-contrib [puppet] - 10https://gerrit.wikimedia.org/r/849089 (https://phabricator.wikimedia.org/T320551) (owner: 10Clément Goubert) [15:11:15] !log installing isc-dhcp security updates [15:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36309 and previous config saved to /var/cache/conftool/dbconfig/20221025-151308-ladsgroup.json [15:13:19] (03CR) 10Vgutierrez: [C: 03+1] Clean up outdated commentary on requestctl [puppet] - 10https://gerrit.wikimedia.org/r/845648 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [15:17:19] (03CR) 10Cwhite: [C: 03+2] logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [15:21:38] (03CR) 10Elukey: [C: 03+1] "Looks good to me, it should work, even if the usage of should be limited as much as possible for perf reasons IIRC. In this case is p" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108 (owner: 10Giuseppe Lavagetto) [15:22:06] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [15:25:33] !log added component thirdparty/otelcol-contrib to apt repository [15:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P36310 and previous config saved to /var/cache/conftool/dbconfig/20221025-152815-ladsgroup.json [15:28:37] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) >>! In T321547#8341438, @Vgutierrez wrote: > nice catch @fgiunchedi. Actually I've assumed that it wasn't fired cause we didn't get the recovery on the traffic IRC channel when T321545 got fixed... [15:29:41] (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: bow to the will of the evil overlord, httpd [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108 [15:30:17] !log added package otelcol-contrib_0.62.1_linux_amd64.deb to component thirdparty/otelcol-contrib for bullseye-wikimedia and buster-wikimedia - T320551 [15:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:23] T320551: Package OpenTelemetry Collector as a .deb - https://phabricator.wikimedia.org/T320551 [15:33:26] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) >>! In T308677#8339287, @jbond wrote: >>>! In T308677#8338658,... [15:36:04] (03PS1) 10Ayounsi: Add Peering News to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/849114 [15:37:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8399 [15:38:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8399 [15:39:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [15:40:05] (03CR) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [15:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P36311 and previous config saved to /var/cache/conftool/dbconfig/20221025-154321-ladsgroup.json [15:43:35] (03PS7) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [15:44:43] (03PS1) 10Hnowlan: kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 [15:47:22] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [15:48:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) I don't remember the impact of a switchover (eg. if it's none or tiny). So to be done carefully. At least the... [15:49:45] (03CR) 10FNegri: [V: 03+1 C: 03+2] Add thirdparty/tekton repo to WMCS bastions [puppet] - 10https://gerrit.wikimedia.org/r/849047 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [15:50:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: refresh licence [puppet] - 10https://gerrit.wikimedia.org/r/849053 (https://phabricator.wikimedia.org/T308013) (owner: 10Arturo Borrero Gonzalez) [15:54:44] (03CR) 10Ottomata: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [15:56:05] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:57:09] (03PS8) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [15:57:50] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: bow to the will of the evil overlord, httpd [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/849108 (owner: 10Giuseppe Lavagetto) [15:58:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321312)', diff saved to https://phabricator.wikimedia.org/P36312 and previous config saved to /var/cache/conftool/dbconfig/20221025-155828-ladsgroup.json [15:58:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:58:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:58:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T321312)', diff saved to https://phabricator.wikimedia.org/P36313 and previous config saved to /var/cache/conftool/dbconfig/20221025-155855-ladsgroup.json [15:59:28] (03PS1) 10Vgutierrez: hieradata pcc: Update deployment-puppetmaster04 public key [puppet] - 10https://gerrit.wikimedia.org/r/849121 [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:09] (03PS2) 10Vgutierrez: hieradata pcc: Update deployment-puppetmaster04 public key [puppet] - 10https://gerrit.wikimedia.org/r/849121 [16:00:34] jbond: ^^ I don't know if I'm missing something there [16:00:42] but it feels pretty weird to me [16:00:45] (03PS1) 10Btullis: Open up the postrges service to the analytics vlans [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440) [16:01:06] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [16:04:19] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [16:04:32] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [16:04:46] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [16:05:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321312)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221025-160504-ladsgroup.json [16:06:11] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wcqs2002.codfw.wmnet [16:06:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs2002.codfw.wmnet [16:06:37] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37744/console" [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [16:08:40] (03PS2) 10Btullis: Open up the postrges service to the analytics vlans [puppet] - 10https://gerrit.wikimedia.org/r/849122 (https://phabricator.wikimedia.org/T319440) [16:09:15] (03CR) 10Vgutierrez: "current logic looks good to me as well" [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [16:12:06] (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/828109 (https://phabricator.wikimedia.org/T304440) [16:14:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:39] (03CR) 10Vgutierrez: [C: 03+2] "```" [puppet] - 10https://gerrit.wikimedia.org/r/849121 (owner: 10Vgutierrez) [16:14:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:55] (03CR) 10BBlack: [C: 03+2] Add wikifunctions.org to exim domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [16:16:54] (03PS2) 10Vgutierrez: acme_chief: Test adding wikifunctions.org in acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/849111 (https://phabricator.wikimedia.org/T313227) [16:18:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:18:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.810 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P36315 and previous config saved to /var/cache/conftool/dbconfig/20221025-162015-ladsgroup.json [16:33:59] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P36316 and previous config saved to /var/cache/conftool/dbconfig/20221025-163522-ladsgroup.json [16:36:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:58:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P36320 and previous config saved to /var/cache/conftool/dbconfig/20221025-165831-ladsgroup.json [16:59:39] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [17:02:07] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) sub-ports are ready for cr2-eqiad ` papaul@re0.cr2-eqiad# run show interfaces terse | match xe-1/0/* xe-1/0/1:0 down down xe-1/0/1:1... [17:02:14] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [17:02:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:02:33] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [17:02:45] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:47] (03PS1) 10Andrew Bogott: OpenStack trove: expose API to the public internet [puppet] - 10https://gerrit.wikimedia.org/r/849127 (https://phabricator.wikimedia.org/T319312) [17:03:09] (03PS2) 10Hnowlan: api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) [17:05:01] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:05:11] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:06:35] (03CR) 10CI reject: [V: 04-1] api-gateway: create fine-grained liftwing API definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/844452 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [17:08:33] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [17:09:47] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:09:59] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:39] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack trove: expose API to the public internet [puppet] - 10https://gerrit.wikimedia.org/r/849127 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [17:10:41] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [17:12:01] (03PS1) 10Andrew Bogott: haproxy/ferm: rename internal ferm rules 'internal' rather than 'public' [puppet] - 10https://gerrit.wikimedia.org/r/849128 (https://phabricator.wikimedia.org/T319312) [17:12:45] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:12:53] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:13:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P36321 and previous config saved to /var/cache/conftool/dbconfig/20221025-171337-ladsgroup.json [17:13:44] 10SRE, 10Traffic, 10observability: rate() requires at least >=2m for HAProxy metrics in upload@(eqiad|codfw) - https://phabricator.wikimedia.org/T321553 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks! [17:13:56] (03PS2) 10Andrew Bogott: haproxy/ferm: rename internal ferm rules 'internal' rather than 'public' [puppet] - 10https://gerrit.wikimedia.org/r/849128 (https://phabricator.wikimedia.org/T319312) [17:14:33] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:14:36] (03CR) 10Andrew Bogott: [C: 03+2] haproxy/ferm: rename internal ferm rules 'internal' rather than 'public' [puppet] - 10https://gerrit.wikimedia.org/r/849128 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [17:14:39] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:43] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:17:45] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:17:47] (03PS1) 10FNegri: Don't use slash in apt:repo name [puppet] - 10https://gerrit.wikimedia.org/r/849129 [17:17:57] (03PS10) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [17:17:59] (03PS1) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [17:18:25] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849129 (owner: 10FNegri) [17:20:03] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [17:20:14] (03CR) 10FNegri: [C: 03+2] Don't use slash in apt:repo name [puppet] - 10https://gerrit.wikimedia.org/r/849129 (owner: 10FNegri) [17:21:35] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [17:21:50] (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [17:23:21] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:57] !log mforns@deploy1002 Started deploy [analytics/refinery@d3b7785] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d3b7785] [17:24:18] (03PS1) 10Herron: slo_dashboard: move to one SLO/SLI per dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) [17:25:00] (03PS2) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) [17:25:01] !log mforns@deploy1002 Finished deploy [analytics/refinery@d3b7785] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d3b7785] (duration: 01m 04s) [17:26:32] (03CR) 10Herron: slo_dashboards: move slo definitions and defaults to files (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [17:27:42] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:28:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321312)', diff saved to https://phabricator.wikimedia.org/P36322 and previous config saved to /var/cache/conftool/dbconfig/20221025-172844-ladsgroup.json [17:28:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:29:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:29:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T321312)', diff saved to https://phabricator.wikimedia.org/P36323 and previous config saved to /var/cache/conftool/dbconfig/20221025-172909-ladsgroup.json [17:30:46] (03CR) 10Herron: "note: this patch should be a noop in terms of grafana dashboard output" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [17:31:47] (03CR) 10David Caro: [C: 03+2] p::toolforge:harbor::prepare: upgrade harbor to v2.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/848602 (https://phabricator.wikimedia.org/T316530) (owner: 10Raymond Ndibe) [17:38:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T321312)', diff saved to https://phabricator.wikimedia.org/P36324 and previous config saved to /var/cache/conftool/dbconfig/20221025-173817-ladsgroup.json [17:39:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [17:40:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [17:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T321312)', diff saved to https://phabricator.wikimedia.org/P36325 and previous config saved to /var/cache/conftool/dbconfig/20221025-174013-ladsgroup.json [17:42:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321312)', diff saved to https://phabricator.wikimedia.org/P36326 and previous config saved to /var/cache/conftool/dbconfig/20221025-174639-ladsgroup.json [17:53:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P36327 and previous config saved to /var/cache/conftool/dbconfig/20221025-175323-ladsgroup.json [17:54:52] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:55:30] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:42] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:57:40] !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:57:40] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4025.ulsfo.wmnet [17:57:44] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4025.ulsfo.wmnet` - cp4025.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [17:58:17] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:18] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4023.ulsfo.wmnet [17:58:22] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4023.ulsfo.wmnet` - cp4023.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [17:58:22] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:59:38] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [17:59:44] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:23] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:34] (03PS8) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [18:01:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36328 and previous config saved to /var/cache/conftool/dbconfig/20221025-180145-ladsgroup.json [18:02:11] (03CR) 10Muehlenhoff: "Ack, thanks for the various reviews. I'm going to merge and then we can test this (and fine-tune if needed) once the OpenSearch update is " [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [18:03:50] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:04:25] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4039 [18:04:40] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4039 [18:04:44] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4041 [18:04:58] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4041 [18:05:10] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4043 [18:05:25] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4043 [18:06:42] (03CR) 10Eevans: [C: 03+1] kask: make TLS configuration a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/849117 (owner: 10Hnowlan) [18:07:14] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4039.mgmt.ulsfo.wmnet with reboot policy FORCED [18:07:44] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4041.mgmt.ulsfo.wmnet with reboot policy FORCED [18:08:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P36329 and previous config saved to /var/cache/conftool/dbconfig/20221025-180830-ladsgroup.json [18:09:08] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4043.mgmt.ulsfo.wmnet with reboot policy FORCED [18:11:26] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:13:27] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [18:13:40] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [18:13:44] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [18:16:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36330 and previous config saved to /var/cache/conftool/dbconfig/20221025-181652-ladsgroup.json [18:18:36] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to restart/reboot logstash collector nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [18:19:15] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [18:19:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) failure of provision script against cp4039 ` [1/30, retrying in 30.00s] Polling task: JID_667217070909 not completed yet: status=OK, state=Running, complete... [18:19:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4041.mgmt.ulsfo.wmnet with reboot policy FORCED [18:20:04] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4039.mgmt.ulsfo.wmnet with reboot policy FORCED [18:21:54] (03PS2) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [18:22:00] (03PS11) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [18:22:59] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4043.mgmt.ulsfo.wmnet with reboot policy FORCED [18:23:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T321312)', diff saved to https://phabricator.wikimedia.org/P36331 and previous config saved to /var/cache/conftool/dbconfig/20221025-182336-ladsgroup.json [18:23:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:23:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:24:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T321312)', diff saved to https://phabricator.wikimedia.org/P36332 and previous config saved to /var/cache/conftool/dbconfig/20221025-182402-ladsgroup.json [18:24:16] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:26:28] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [18:29:39] (03PS12) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [18:29:46] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4022.ulsfo.wmnet [18:29:56] (03PS13) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [18:29:58] (03PS1) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [18:30:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321312)', diff saved to https://phabricator.wikimedia.org/P36333 and previous config saved to /var/cache/conftool/dbconfig/20221025-183008-ladsgroup.json [18:30:44] (03CR) 10Jbond: "ready for review, examples in the follow up patches" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [18:31:21] (03CR) 10Nskaggs: [C: 03+1] "Looks good. Modifying the array to add more projects should be simpler now." [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [18:31:50] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4024.ulsfo.wmnet [18:31:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321312)', diff saved to https://phabricator.wikimedia.org/P36334 and previous config saved to /var/cache/conftool/dbconfig/20221025-183158-ladsgroup.json [18:32:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [18:32:16] (03CR) 10Nskaggs: [C: 03+1] dumps: switch kiwix download host to master.download.kiwix.org [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [18:32:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [18:32:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T321312)', diff saved to https://phabricator.wikimedia.org/P36335 and previous config saved to /var/cache/conftool/dbconfig/20221025-183224-ladsgroup.json [18:33:32] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [18:33:41] (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [18:34:16] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:34:16] (03PS3) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [18:34:27] (03PS14) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [18:34:33] (03PS2) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [18:34:52] PROBLEM - Confd vcl based reload on cp4033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:35:40] PROBLEM - Confd vcl based reload on cp4049 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:35:50] PROBLEM - Confd vcl based reload on cp4045 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:36:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [18:36:12] PROBLEM - Confd vcl based reload on cp4034 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:36:16] PROBLEM - Confd vcl based reload on cp4026 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:37:00] PROBLEM - PyBal backends health check on lvs4006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp4024.ulsfo.wmnet are marked down but pooled: uploadlb_443: Servers cp4024.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:37:06] (ConfdResourceFailed) firing: (12) confd resource _srv_config-master_pybal_codfw_upload-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:37:18] PROBLEM - PyBal backends health check on lvs4007 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp4024.ulsfo.wmnet are marked down but pooled: uploadlb_443: Servers cp4024.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:37:30] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:37:40] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:41] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4022.ulsfo.wmnet [18:37:44] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4022.ulsfo.wmnet` - cp4022.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [18:38:14] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [18:38:20] (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [18:38:44] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:38:44] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4024.ulsfo.wmnet [18:38:47] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4024.ulsfo.wmnet` - cp4024.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [18:40:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321312)', diff saved to https://phabricator.wikimedia.org/P36336 and previous config saved to /var/cache/conftool/dbconfig/20221025-184006-ladsgroup.json [18:41:10] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4026.ulsfo.wmnet [18:41:59] (03PS4) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [18:42:06] (ConfdResourceFailed) firing: (24) confd resource _srv_config-master_pybal_codfw_upload-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:44:43] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4028.ulsfo.wmnet [18:45:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P36337 and previous config saved to /var/cache/conftool/dbconfig/20221025-184514-ladsgroup.json [18:45:18] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4030.ulsfo.wmnet [18:45:49] (03PS15) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [18:45:51] (03PS3) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [18:46:11] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:46:30] PROBLEM - Confd vcl based reload on cp4047 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:47:32] PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:47:59] (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [18:49:05] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:06] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4026.ulsfo.wmnet [18:49:09] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4026.ulsfo.wmnet` - cp4026.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [18:49:42] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:49:52] PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:49:52] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:49:52] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4032.ulsfo.wmnet [18:50:20] (03CR) 10Bking: [C: 03+1] elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 (owner: 10DCausse) [18:50:23] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:50:26] (03CR) 10Ryan Kemper: [C: 03+1] elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 (owner: 10DCausse) [18:50:26] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:50:32] (03CR) 10Bking: [C: 03+2] elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 (owner: 10DCausse) [18:51:27] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [18:51:45] (JobUnavailable) firing: (3) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:52:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:40] PROBLEM - Host elastic2052 is DOWN: PING CRITICAL - Packet loss = 100% [18:52:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:57] !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:52:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4030.ulsfo.wmnet [18:53:02] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4030.ulsfo.wmnet` - cp4030.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [18:53:34] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:53:36] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:37] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4028.ulsfo.wmnet [18:53:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:44] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4028.ulsfo.wmnet` - cp4028.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [18:53:48] RECOVERY - Host elastic2052 is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms [18:54:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:55:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36338 and previous config saved to /var/cache/conftool/dbconfig/20221025-185513-ladsgroup.json [18:55:56] PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:55:56] PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:56:18] PROBLEM - PyBal backends health check on lvs4005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp4032.ulsfo.wmnet are marked down but pooled: textlb_443: Servers cp4032.ulsfo.wmnet are marked down but pooled: testlb6_443: Servers cp4032.ulsfo.wmnet are marked down but pooled: textlb6_443: Servers cp4032.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:27] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:56:36] PROBLEM - Confd vcl based reload on cp4036 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:56:45] (JobUnavailable) firing: (4) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:58] PROBLEM - Confd vcl based reload on cp4035 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:57:40] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:57:41] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4032.ulsfo.wmnet [18:57:44] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4032.ulsfo.wmnet` - cp4032.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [18:59:17] !log bking@elastic2070 'restarting elastic7 services to apply 838141' [18:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P36339 and previous config saved to /var/cache/conftool/dbconfig/20221025-190021-ladsgroup.json [19:01:45] (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:06] (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:08:09] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:10:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36340 and previous config saved to /var/cache/conftool/dbconfig/20221025-191020-ladsgroup.json [19:10:35] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:12:06] (ConfdResourceFailed) firing: (46) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:12:41] (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/828109 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [19:13:22] (03PS2) 10Ssingh: cp4023: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/848423 (https://phabricator.wikimedia.org/T317244) [19:14:49] (03CR) 10Ssingh: [C: 03+2] cp4023: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/848423 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [19:15:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321312)', diff saved to https://phabricator.wikimedia.org/P36341 and previous config saved to /var/cache/conftool/dbconfig/20221025-191527-ladsgroup.json [19:15:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [19:15:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [19:15:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T321312)', diff saved to https://phabricator.wikimedia.org/P36342 and previous config saved to /var/cache/conftool/dbconfig/20221025-191552-ladsgroup.json [19:21:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.684 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321312)', diff saved to https://phabricator.wikimedia.org/P36343 and previous config saved to /var/cache/conftool/dbconfig/20221025-192203-ladsgroup.json [19:23:53] (03PS1) 10Ssingh: cp4025: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849140 (https://phabricator.wikimedia.org/T317244) [19:25:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321312)', diff saved to https://phabricator.wikimedia.org/P36344 and previous config saved to /var/cache/conftool/dbconfig/20221025-192526-ladsgroup.json [19:25:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:25:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:25:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [19:25:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [19:25:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T321312)', diff saved to https://phabricator.wikimedia.org/P36345 and previous config saved to /var/cache/conftool/dbconfig/20221025-192556-ladsgroup.json [19:26:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:43] PROBLEM - Check systemd state on kubernetes2012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:02] (03CR) 10Ssingh: [C: 03+2] cp4025: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849140 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [19:27:55] PROBLEM - Host wcqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.524 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:28:17] RECOVERY - Host wcqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:28:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 2.514 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:29:25] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10RKemper) >>! In T320482#8309495, @Papaul wrote: > @bking this host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk. Ano... [19:33:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321312)', diff saved to https://phabricator.wikimedia.org/P36347 and previous config saved to /var/cache/conftool/dbconfig/20221025-193331-ladsgroup.json [19:34:04] (03PS1) 10Ssingh: cp402[2468]: decommission hosts as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244) [19:36:49] (03PS1) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [19:37:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P36348 and previous config saved to /var/cache/conftool/dbconfig/20221025-193709-ladsgroup.json [19:39:58] !log logstash opensearch 2.2.0 codfw transition complete T304440 [19:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:04] T304440: Test and upgrade OpenSearch to 2.2.0 - https://phabricator.wikimedia.org/T304440 [19:40:33] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Pcoombe) [19:40:49] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Pcoombe) [19:40:51] (03PS9) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [19:43:45] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:44:29] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [19:47:55] (03PS1) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [19:48:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36349 and previous config saved to /var/cache/conftool/dbconfig/20221025-194838-ladsgroup.json [19:48:48] (03PS2) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [19:49:45] (03PS3) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [19:50:35] (03PS4) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [19:50:54] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:52:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P36350 and previous config saved to /var/cache/conftool/dbconfig/20221025-195216-ladsgroup.json [19:53:40] (03PS2) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [19:53:52] (03CR) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [19:53:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:54:05] (03PS3) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [19:54:07] RECOVERY - Check systemd state on kubernetes2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:48] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:54:53] (03CR) 10CI reject: [V: 04-1] Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [19:55:02] (03PS1) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) [19:56:45] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:58] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4038 [19:57:14] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4038 [19:57:15] (03PS2) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) [19:58:17] (03PS1) 10Andrew Bogott: Revert "wmcs: format and refactor maintain-dbusers.py" [puppet] - 10https://gerrit.wikimedia.org/r/849150 [19:58:27] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:58:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:00:04] RoanKattouw, Urbanecm, and cjming: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221025T2000). [20:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] present [20:00:30] hi ! i can deploy [20:00:49] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED [20:01:03] (03CR) 10CI reject: [V: 04-1] query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [20:01:57] Jdlrobson: is that one test failing a problem? [20:02:37] looking.. [20:03:03] some linting issues.. fixing.. [20:03:10] mostly linting errors [20:03:24] (03PS3) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) [20:03:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36351 and previous config saved to /var/cache/conftool/dbconfig/20221025-200344-ladsgroup.json [20:03:47] (03PS4) 10Bking: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) [20:05:25] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs: format and refactor maintain-dbusers.py" [puppet] - 10https://gerrit.wikimedia.org/r/849150 (owner: 10Andrew Bogott) [20:05:39] (03PS4) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [20:06:26] cmjohnson1: ok that should do it [20:07:04] (03CR) 10CI reject: [V: 04-1] Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:07:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321312)', diff saved to https://phabricator.wikimedia.org/P36352 and previous config saved to /var/cache/conftool/dbconfig/20221025-200723-ladsgroup.json [20:07:23] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:07:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [20:07:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [20:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T321312)', diff saved to https://phabricator.wikimedia.org/P36353 and previous config saved to /var/cache/conftool/dbconfig/20221025-200746-ladsgroup.json [20:08:12] Jdlrobson: almost - wg prefix? [20:08:43] (03CR) 10Ryan Kemper: [C: 03+1] query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [20:08:48] * urbanecm waves and sees B&C is being taken care of, thanks cjming [20:08:54] np! [20:09:26] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:44] :D [20:10:06] (03CR) 10Bking: [C: 03+2] query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [20:10:13] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4040 [20:10:25] (03PS5) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [20:10:27] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4040 [20:10:31] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4042 [20:10:46] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4042 [20:10:50] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4044 [20:11:04] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4044 [20:11:06] (03PS1) 10Raymond Ndibe: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849166 (https://phabricator.wikimedia.org/T304040) [20:11:07] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4046 [20:11:22] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4046 [20:11:26] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4048 [20:11:41] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4048 [20:13:34] (03Merged) 10jenkins-bot: query_service: sanity-check file size on data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/849145 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [20:13:35] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [20:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321312)', diff saved to https://phabricator.wikimedia.org/P36354 and previous config saved to /var/cache/conftool/dbconfig/20221025-201343-ladsgroup.json [20:13:58] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4034.ulsfo.wmnet [20:14:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:14:27] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cp4034.ulsfo.wmnet [20:14:43] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:14:48] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4034.ulsfo.wmnet [20:14:49] (03PS6) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [20:15:13] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4036.ulsfo.wmnet [20:15:55] (03PS7) 10Jdlrobson: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) [20:15:58] sorry cjming ^ [20:16:04] had to make some updates to one of the assets [20:16:14] oh - whoops [20:17:31] hmm -- quick Q urbanecm: if i've already run scap backport but the patch got updated during it, will it just fail and i can rerun? [20:17:43] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10XenoRyet) [20:18:01] cjming: if the patch did not get merge yet, it will hang forever [20:18:08] it should work if you +2 it manually [20:18:11] (the patch, i mean) [20:18:17] cool - thanks [20:18:42] (03CR) 10Dzahn: [C: 03+2] miscweb: add rsyslog::input::files to send apache logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [20:18:51] Jdlrobson: gtg? [20:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321312)', diff saved to https://phabricator.wikimedia.org/P36355 and previous config saved to /var/cache/conftool/dbconfig/20221025-201852-ladsgroup.json [20:18:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [20:19:03] gtg! [20:19:10] (03CR) 10Clare Ming: [C: 03+2] Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:19:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [20:19:13] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: mnt-nfs-dumps\x2dlabstore1006.wikimedia.org.mount,mnt-nfs-dumps\x2dlabstore1007.wikimedia.org.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T321312)', diff saved to https://phabricator.wikimedia.org/P36356 and previous config saved to /var/cache/conftool/dbconfig/20221025-201918-ladsgroup.json [20:20:07] (03Merged) 10jenkins-bot: Update remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849142 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:20:32] !log cjming@deploy1002 Started scap: Backport for [[gerrit:849142|Update remaining Wikipedia logos (T319223)]] [20:20:38] T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223 [20:20:56] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:849142|Update remaining Wikipedia logos (T319223)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:21:09] Jdlrobson: wanna check a debug server? [20:21:37] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:21:38] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:21:45] (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:21:50] cjming: yes please [20:21:53] PROBLEM - Confd vcl based reload on cp4037 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [20:22:11] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:22:21] LGTM cjming ! [20:22:35] great - going live [20:23:28] !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:23:30] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4036.ulsfo.wmnet [20:23:38] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4036.ulsfo.wmnet` - cp4036.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [20:24:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849166 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:24:05] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:24:05] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4034.ulsfo.wmnet [20:24:08] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4034.ulsfo.wmnet` - cp4034.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Fo... [20:24:57] Thanks a lot @cjming ! [20:24:59] looking great [20:25:22] (03PS1) 10Dzahn: rsyslog: forward miscweb logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) [20:25:39] np! [20:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321312)', diff saved to https://phabricator.wikimedia.org/P36357 and previous config saved to /var/cache/conftool/dbconfig/20221025-202538-ladsgroup.json [20:26:36] cjming: did it work? [20:27:06] (ConfdResourceFailed) firing: (48) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:27:15] urbanecm: yes! thanks - another Q tho - i think i need to purge a few of the files - can i run purgeList on a directory? [20:27:20] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:849142|Update remaining Wikipedia logos (T319223)]] (duration: 06m 48s) [20:27:26] T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223 [20:28:08] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED [20:28:14] cjming: purgeList.php needs URIs to purge, unfortunately. [20:28:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:28:32] bummer - ok, i'll run each one [20:28:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P36358 and previous config saved to /var/cache/conftool/dbconfig/20221025-202849-ladsgroup.json [20:28:57] PROBLEM - Host cp4034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:28:57] PROBLEM - Host cp4036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:29:04] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED [20:29:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:26] cjming: you can do something like `ls /srv/mediawiki-staging/static/images/mobile/copyright/ | sed 's#^#https://en.wikipedia.org/static/images/mobile/copyright/#g' | mwscript purgeList.php` though [20:30:00] fancy - ok i'll try that [20:30:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:31:55] thanks urbanecm: works perfectly [20:32:00] great! [20:34:24] PROBLEM - Host wcqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [20:35:16] RECOVERY - Host wcqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [20:36:09] hi cjming and Jdlrobson, there's a problem with zhwiki's tagline, actually we don't want to have it looks like this as our local consensus is not to change the tagline for 20 years celebration [20:36:56] and currently the tagline looks pretty small https://zh.wikipedia.org/wiki/?useskin=vector-2022 [20:37:26] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED [20:39:21] Jdlrobson: do you have the previous file? we can revert that one per koi's note above [20:39:27] (03CR) 10Dzahn: "compiling on C:profile::rsyslog::kafka_shipper which is a lot of hosts" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [20:39:53] cjming: I'm writing a patch now, please wait a sec [20:40:07] koi: great - standing by [20:40:27] Jdlrobson: nvm - i'll wait for koi's patch [20:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36359 and previous config saved to /var/cache/conftool/dbconfig/20221025-204045-ladsgroup.json [20:41:36] (03PS1) 10Stang: Revert tagline of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849171 [20:41:58] cjming: uploaded ^ [20:42:09] yup - on it [20:43:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849171 (owner: 10Stang) [20:43:13] cjming: hang on, I forgot one part [20:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P36360 and previous config saved to /var/cache/conftool/dbconfig/20221025-204356-ladsgroup.json [20:44:09] (03Merged) 10jenkins-bot: Revert tagline of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849171 (owner: 10Stang) [20:44:14] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:44:31] koi: whoops - i seem to be trigger-happy today -- can it be a follow up? [20:44:35] !log cjming@deploy1002 Started scap: Backport for [[gerrit:849171|Revert tagline of zhwiki]] [20:44:46] ok, trying [20:44:58] !log cjming@deploy1002 cjming and stang: Backport for [[gerrit:849171|Revert tagline of zhwiki]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:45:31] koi: it's up on the debug servers ^^ if you want to verify [20:46:48] (03PS1) 10Raymond Ndibe: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849173 (https://phabricator.wikimedia.org/T304040) [20:47:04] PROBLEM - Host wcqs2003 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:44] (03PS1) 10Stang: Revert tagline of zhwiki (cont.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849174 [20:48:05] cjming: posted another one, the continue of the first patch [20:48:19] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:48:54] RECOVERY - Host wcqs2003 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [20:48:54] ok - since your first revert already merged, i'll sync that one and do your 2nd patch - then purge it [20:48:59] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4050 [20:49:14] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4050 [20:49:18] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4052 [20:49:22] ok! [20:49:33] (03CR) 10Stef Dunlap: "Would you mind reviewing my patch or suggesting someone who might be able to review it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap) [20:49:36] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4052 [20:50:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:50:40] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/37748/" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [20:51:54] (03PS1) 10Jdlrobson: WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) [20:53:11] (03PS2) 10Andrew Bogott: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849173 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:53:46] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:849171|Revert tagline of zhwiki]] (duration: 09m 11s) [20:54:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849174 (owner: 10Stang) [20:54:44] (03Merged) 10jenkins-bot: Revert tagline of zhwiki (cont.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849174 (owner: 10Stang) [20:55:09] !log cjming@deploy1002 Started scap: Backport for [[gerrit:849174|Revert tagline of zhwiki (cont.)]] [20:55:32] !log cjming@deploy1002 cjming and stang: Backport for [[gerrit:849174|Revert tagline of zhwiki (cont.)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:55:35] koi: can you check on a debug server? [20:55:39] looking [20:55:41] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/849173 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:55:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36361 and previous config saved to /var/cache/conftool/dbconfig/20221025-205551-ladsgroup.json [20:55:54] cjming: LGTM [20:56:00] great - syncing [20:56:54] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:58:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:58:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:59:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321312)', diff saved to https://phabricator.wikimedia.org/P36362 and previous config saved to /var/cache/conftool/dbconfig/20221025-205902-ladsgroup.json [20:59:59] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:849174|Revert tagline of zhwiki (cont.)]] (duration: 04m 49s) [21:00:22] koi: should be live - purged your files [21:01:15] thanks! [21:01:19] np! [21:01:21] !log end of UTC late backport window [21:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:16] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) [21:03:10] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:03:51] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4033.ulsfo.wmnet [21:03:55] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED [21:03:56] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4035.ulsfo.wmnet [21:04:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:05:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:05:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:05:35] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [21:06:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:08:11] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:08:23] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:09:45] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [21:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321312)', diff saved to https://phabricator.wikimedia.org/P36363 and previous config saved to /var/cache/conftool/dbconfig/20221025-211058-ladsgroup.json [21:11:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:11:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36364 and previous config saved to /var/cache/conftool/dbconfig/20221025-211125-ladsgroup.json [21:11:45] (JobUnavailable) firing: (6) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:12:01] !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:12:02] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4035.ulsfo.wmnet [21:12:05] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4035.ulsfo.wmnet` - cp4035.ulsfo.wmnet (**FAIL**) - Downtimed host on Icinga/Alertmanager - Fo... [21:12:38] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:12:40] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4033.ulsfo.wmnet [21:12:44] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4033.ulsfo.wmnet` - cp4033.ulsfo.wmnet (**FAIL**) - Downtimed host on Icinga/Alertmanager - Fo... [21:15:32] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4038.mgmt.ulsfo.wmnet with reboot policy FORCED [21:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36365 and previous config saved to /var/cache/conftool/dbconfig/20221025-211730-ladsgroup.json [21:20:18] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:20:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:20:46] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:21:24] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [21:28:07] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED [21:29:12] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4033.ulsfo.wmnet [21:31:45] (JobUnavailable) firing: (7) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36366 and previous config saved to /var/cache/conftool/dbconfig/20221025-213236-ladsgroup.json [21:34:28] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:34:54] (03PS1) 10BBlack: Clean up trafficserver::tls and related [puppet] - 10https://gerrit.wikimedia.org/r/849178 [21:34:56] (03PS1) 10BBlack: Remove cache::(text|upload)_envoy remnants [puppet] - 10https://gerrit.wikimedia.org/r/849179 [21:34:58] (03PS1) 10BBlack: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 [21:37:40] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:37:43] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:38:36] (03CR) 10BBlack: [C: 03+2] Clean up outdated commentary on requestctl [puppet] - 10https://gerrit.wikimedia.org/r/845648 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [21:38:54] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:38:55] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4033.ulsfo.wmnet [21:38:59] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4033.ulsfo.wmnet` - cp4033.ulsfo.wmnet (**FAIL**) - //Host not found on Icinga, unable to downti... [21:41:55] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:42:50] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4035.ulsfo.wmnet [21:43:06] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED [21:43:20] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4044.mgmt.ulsfo.wmnet with reboot policy FORCED [21:45:45] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4009 [21:46:01] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs4009 [21:46:43] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4010 [21:46:44] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [21:47:11] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs4010 [21:47:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36367 and previous config saved to /var/cache/conftool/dbconfig/20221025-214743-ladsgroup.json [21:50:14] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:50:20] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:50:50] PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([lvs4010.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [21:51:36] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:51:37] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp4035.ulsfo.wmnet [21:51:40] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp4035.ulsfo.wmnet` - cp4035.ulsfo.wmnet (**FAIL**) - //Host not found on Icinga, unable to downti... [21:53:44] PROBLEM - PyBal IPVS diff check on lvs4006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([lvs4009.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [21:57:39] (03CR) 10Dzahn: [C: 03+2] "thanks as well" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [21:57:41] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4044.mgmt.ulsfo.wmnet with reboot policy FORCED [21:59:04] !log robh@cumin2002 START - Cookbook sre.dns.netbox [22:00:47] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4051 [22:01:02] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4051 [22:01:02] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:28] PROBLEM - PyBal IPVS diff check on lvs4007 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:01:45] (JobUnavailable) firing: (6) Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:02:18] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [22:02:23] (03PS1) 10BCornwall: WIP: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [22:02:42] (03CR) 10Cwhite: [C: 03+1] alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [22:02:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321312)', diff saved to https://phabricator.wikimedia.org/P36368 and previous config saved to /var/cache/conftool/dbconfig/20221025-220249-ladsgroup.json [22:03:44] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED [22:04:17] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED [22:04:53] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED [22:07:52] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [22:08:36] (03CR) 10BCornwall: [C: 03+1] "Seems like the surrounding lines in the block are tab characters while, so this brings it in line with the expectation (for the block, at " [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [22:11:30] PROBLEM - PyBal IPVS diff check on lvs4006 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4033.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:11:46] (03CR) 10Dzahn: "just adding history here. once upon a time all files, including .erb templates, in the puppet repo had tab indentation. Then we switched t" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [22:12:00] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS [22:14:32] PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:16:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED [22:17:01] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED [22:17:08] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED [22:17:41] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:17:44] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED [22:18:12] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED [22:18:16] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:18:46] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:19:45] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [22:20:17] (03CR) 10Dzahn: [C: 03+2] "I can find some logs on logstash by filtering for miscweb* host names, but they are only the puppet runs (puppet: unchanged), I don't see " [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [22:20:17] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [22:24:31] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:24:40] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [22:24:42] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [22:25:24] PROBLEM - PyBal IPVS diff check on lvs4007 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4033.ulsfo.wmnet, cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:25:25] 10SRE, 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10RobH) 05Open→03Resolved [22:26:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [22:28:28] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4040 [22:28:30] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4040 [22:29:18] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:30:38] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:32:28] PROBLEM - PyBal IPVS diff check on lvs4005 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([cp4035.ulsfo.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:32:45] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4051.mgmt.ulsfo.wmnet with reboot policy FORCED [22:33:52] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [22:34:34] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED [22:34:51] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4050.mgmt.ulsfo.wmnet with reboot policy FORCED [22:35:07] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED [22:35:56] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10BCornwall) Perhaps this is because the severity is set to warning rather than critical? [22:36:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED [22:36:16] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED [22:36:29] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4046.mgmt.ulsfo.wmnet with reboot policy FORCED [22:43:05] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [22:43:15] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:44:42] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4040.mgmt.ulsfo.wmnet with reboot policy FORCED [22:45:03] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4051.mgmt.ulsfo.wmnet with reboot policy FORCED [22:45:27] (03CR) 10Cwhite: [C: 03+1] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [22:45:30] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED [22:45:43] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4048.mgmt.ulsfo.wmnet with reboot policy FORCED [22:48:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:51:37] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED [22:51:56] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4042.mgmt.ulsfo.wmnet with reboot policy FORCED [23:07:02] (03PS1) 10Andrew Bogott: Dumps: remove a bunch of references to labstore1006 and labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/849192 (https://phabricator.wikimedia.org/T309346) [23:07:04] (03PS1) 10Andrew Bogott: rsync-via-primary.sh: replace labstore with clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) [23:07:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [23:08:15] (03CR) 10Andrew Bogott: rsync-via-primary.sh: replace labstore with clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849193 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [23:10:16] (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [23:12:32] (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [23:13:28] (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [23:17:02] (03CR) 10Dzahn: [C: 03+2] rsyslog: forward miscweb logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [23:22:45] (03CR) 10Dzahn: [C: 03+2] "I had to create a saved search first. Think I'm good for now. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [23:24:17] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops-collab: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10Dzahn) 05Open→03Resolved This is resolved. logs are now available here: https://logstash.wi... [23:43:46] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10ssingh) [23:57:21] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @BTullis Yea, that is accetable. It's still progress over managing the group members... [23:58:04] (03PS2) 10Ssingh: cp402[2468], cp403[0246]: decommission hosts as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244) [23:58:37] (03CR) 10CI reject: [V: 04-1] cp402[2468], cp403[0246]: decommission hosts as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/849141 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [23:59:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency