[00:01:48] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:02:40] 10SRE-swift-storage, 10MediaWiki-Uploading, 10Patch-For-Review, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9565825 (10Bawolff) Gerrit patch to detect the situation where... [00:02:46] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb_80: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir4002.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:02:48] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir4001.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:02:57] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:03:10] 10SRE-swift-storage, 10UploadWizard: Problem uploading FLAC file in Upload Wizzard to Wikimedia Commons - https://phabricator.wikimedia.org/T355610#9565833 (10Bawolff) https://gerrit.wikimedia.org/r/1005632 may help with this. [00:04:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:05:25] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:10:49] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:11:48] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:11:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:12:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T357189)', diff saved to https://phabricator.wikimedia.org/P57649 and previous config saved to /var/cache/conftool/dbconfig/20240222-001210-arnaudb.json [00:12:27] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [00:12:36] 10SRE-swift-storage, 10UploadWizard: Problem uploading FLAC file in Upload Wizzard to Wikimedia Commons - https://phabricator.wikimedia.org/T355610#9565857 (10Bawolff) [00:12:58] 10SRE-swift-storage, 10MediaWiki-Uploading, 10Patch-For-Review, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9565859 (10Bawolff) [00:13:02] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:13:38] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:13:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:14:54] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433#9565873 (10Bawolff) [00:14:58] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:15:04] 10SRE-swift-storage, 10MediaWiki-Uploading, 10Patch-For-Review, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9565875 (10Bawolff) [00:16:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:17:54] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:18:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:18:56] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:19:54] RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:19:58] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:31] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on conf1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:21:49] (PuppetZeroResources) resolved: (5) Puppet has failed generate resources on mw1415:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:22:49] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1403:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:23:38] (JobUnavailable) firing: (2) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:23:38] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:25:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on parse1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:26:22] (03PS1) 10Ebernhardson: cirrus: Add script to orchestrate reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005635 (https://phabricator.wikimedia.org/T356303) [00:26:31] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:27:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1403:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:28:38] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:28:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:31:43] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, and 2 others: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9565937 (10CodeReviewBot) thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/219 Check bare metal an... [00:32:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:33:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:34:02] (03PS2) 10Ebernhardson: cirrus: Add script to orchestrate reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005635 (https://phabricator.wikimedia.org/T356303) [00:37:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1403:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:38:19] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1403:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:38:26] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.18 ms [00:38:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:39:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005530 [00:39:17] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005530 (owner: 10TrainBranchBot) [00:40:10] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020#9565952 (10Legoktm) queue runner seems to have crashed, based on https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=now-2d&to=now {F42031241} trying to flag down a... [00:42:39] (03PS1) 10Legoktm: Revert "admin: temporarily revoke legoktm's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/1005637 [00:43:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1391:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:43:48] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:43:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:43:51] !log rzl@lists1001:~$ sudo systemctl restart mailman3 [00:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on moss-be1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:47:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1421:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:48:23] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020#9565958 (10RLazarus) 05Open→03Resolved a:03RLazarus Restarted mailman3 at 00:43, icinga alerts are cleared, and the graph in T358020#9565952 is trending down ag... [00:48:48] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020#9565961 (10JJMC89) Looks the same as a previous #wikimedia-incident {T331626} [00:53:18] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1391:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:57:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1421:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:58:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:03:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1391:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:03:37] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005530 (owner: 10TrainBranchBot) [01:07:49] (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on mw1421:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:08:19] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1391:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:18:18] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1371:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:23:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1371:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:25:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1428:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:25:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [01:35:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1428:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:40:33] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:43:48] (PuppetZeroResources) firing: Puppet has failed generate resources on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:48:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1364:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:58:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1364:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:58:33] (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on mw1364:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:58:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:59:48] (PuppetZeroResources) firing: Puppet has failed generate resources on conf1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:59:49] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1483:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:02:24] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [02:03:18] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1364:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:08:58] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.04 ms [02:09:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on conf1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:10:48] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:14:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1483:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:19:49] (PuppetZeroResources) firing: Puppet has failed generate resources on maps1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:20:33] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:20:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:22:19] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:27:48] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:30:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:34:48] (PuppetZeroResources) firing: Puppet has failed generate resources on registry1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:38:03] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1489:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:38:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1450:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:38:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mwmaint1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:45:48] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:48:25] (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mwmaint1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:49:51] (KubernetesAPINotScrapable) firing: (2) k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:50:41] (03CR) 10Ssingh: "I may be completely wrong on this and it's late so apologies in advance: it seems like we have a bunch of Puppet failure after this change" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [02:51:31] (03PS1) 10Ssingh: Revert "etcd: disable the diff output for client config with passwords" [puppet] - 10https://gerrit.wikimedia.org/r/1005482 [02:52:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1468:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:53:03] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1489:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:55:48] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1357:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:58:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1448:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:59:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on maps1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:02:32] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:02:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1468:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:06:22] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:07:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1467:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:08:18] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1448:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:13:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1413:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:13:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:22:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1467:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:33:49] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:34:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1443:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:35:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:42:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1484:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:43:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:44:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on registry1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:48:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:49:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1443:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:53:48] (PuppetZeroResources) firing: Puppet has failed generate resources on maps1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:53:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:55:48] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:55:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:00:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:00:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:01:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.941 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:03:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on dragonfly-supernode1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:05:25] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:10:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:11:48] (PuppetZeroResources) firing: Puppet has failed generate resources on poolcounter1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:13:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:16:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:18:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:20:48] (PuppetZeroResources) firing: Puppet has failed generate resources on moss-be1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:21:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on poolcounter1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:23:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:28:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:30:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:31:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:32:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1484:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:33:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on maps1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:33:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1352:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:37:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:40:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1484:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:42:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:43:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on dragonfly-supernode1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:43:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1411:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:48:55] (SystemdUnitFailed) firing: (2) update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:45] (03CR) 10JHathaway: [C: 03+2] Revert "etcd: disable the diff output for client config with passwords" [puppet] - 10https://gerrit.wikimedia.org/r/1005482 (owner: 10Ssingh) [04:52:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:52:34] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:55:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1485:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:55:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:57:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:57:48] (PuppetZeroResources) firing: Puppet has failed generate resources on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:00:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on moss-be1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:03:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1411:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:06:06] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:06:54] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:10:48] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on parse1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:11:18] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1445:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:13:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1411:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:16:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:22:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:22:52] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:25:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [05:26:18] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on parse1022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:33:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:35:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:40:18] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1443:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:40:22] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:26] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 135, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:41:48] (PuppetZeroResources) firing: Puppet has failed generate resources on irc2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:44:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:48:37] (SystemdUnitFailed) firing: (2) update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:50:18] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1443:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:50:33] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1443:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:51:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:51:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1445:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:51:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:51:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on irc2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:53:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:55:18] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:59:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.325s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:59:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:01:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1445:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:03:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:04:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:09:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.142s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:12:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.043s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:13:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:14:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:17:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:18:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:19:48] (PuppetZeroResources) firing: Puppet has failed generate resources on maps1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:20:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:23:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:24:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:28:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:29:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1356:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:30:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:33:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:33:50] (PuppetZeroResources) firing: Puppet has failed generate resources on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:37:11] (03PS1) 10Marostegui: Revert "db2137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005483 [06:38:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:38:51] (03CR) 10Marostegui: [C: 03+2] Revert "db2137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005483 (owner: 10Marostegui) [06:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57650 and previous config saved to /var/cache/conftool/dbconfig/20240222-063923-root.json [06:42:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1030 as es2 master T358080', diff saved to https://phabricator.wikimedia.org/P57651 and previous config saved to /var/cache/conftool/dbconfig/20240222-064205-marostegui.json [06:42:11] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [06:42:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] deployment_server: Add mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [06:42:47] (03PS1) 10Marostegui: es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005649 (https://phabricator.wikimedia.org/T358080) [06:42:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1033 T358080', diff saved to https://phabricator.wikimedia.org/P57652 and previous config saved to /var/cache/conftool/dbconfig/20240222-064253-root.json [06:43:50] (PuppetZeroResources) resolved: Puppet has failed generate resources on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:44:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1033.eqiad.wmnet with OS bookworm [06:44:29] (03CR) 10Marostegui: [C: 03+2] es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005649 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui) [06:46:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1445:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:46:32] (03PS1) 10Marostegui: clouddb1017: Migration to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005650 (https://phabricator.wikimedia.org/T356838) [06:46:44] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s3 [06:46:46] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s1 [06:47:33] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s1 [06:47:42] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s3 [06:48:07] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [06:48:10] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [06:48:19] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [06:48:25] (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:32] (03CR) 10Marostegui: "The host is already depooled:" [puppet] - 10https://gerrit.wikimedia.org/r/1005650 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [06:48:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:49:49] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:50:06] (KubernetesAPINotScrapable) firing: (2) k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:53:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1370:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:54:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57653 and previous config saved to /var/cache/conftool/dbconfig/20240222-065428-root.json [06:55:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [06:57:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage [06:58:01] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage [06:58:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall LGTM, couple comments on helmfile.yaml, but they're not in the way of merging this patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [06:58:48] (PuppetZeroResources) firing: Puppet has failed generate resources on dragonfly-supernode1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:58:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T0700) [07:00:04] kormat, marostegui, Amir1, and arnaudb: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T0700). [07:03:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:04:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57654 and previous config saved to /var/cache/conftool/dbconfig/20240222-070933-root.json [07:09:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:11:48] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:12:20] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:14:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on parse1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:17:10] (03PS18) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [07:18:22] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [07:19:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1033.eqiad.wmnet with OS bookworm [07:20:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:23:10] (03PS19) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [07:23:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:24:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57655 and previous config saved to /var/cache/conftool/dbconfig/20240222-072438-root.json [07:24:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on parse1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:25:03] (03PS1) 10Marostegui: Revert "es1033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005484 [07:25:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:25:54] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1005095 (https://phabricator.wikimedia.org/T357749) (owner: 10Muehlenhoff) [07:27:17] (03CR) 10Marostegui: [C: 03+2] Revert "es1033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005484 (owner: 10Marostegui) [07:27:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 1%: After migration', diff saved to https://phabricator.wikimedia.org/P57656 and previous config saved to /var/cache/conftool/dbconfig/20240222-072729-root.json [07:28:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:30:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2026 as es2 codfw master T358080', diff saved to https://phabricator.wikimedia.org/P57657 and previous config saved to /var/cache/conftool/dbconfig/20240222-073017-marostegui.json [07:30:18] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:30:25] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [07:33:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:34:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on parse1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:35:19] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:35:34] (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on mw1400:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:35:54] (03PS1) 10Marostegui: wmnet: Promote es2026 to es2 master [dns] - 10https://gerrit.wikimedia.org/r/1005653 (https://phabricator.wikimedia.org/T358080) [07:38:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:39:40] (03CR) 10Marostegui: [C: 03+2] wmnet: Promote es2026 to es2 master [dns] - 10https://gerrit.wikimedia.org/r/1005653 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui) [07:39:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57658 and previous config saved to /var/cache/conftool/dbconfig/20240222-073943-root.json [07:39:48] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on parse1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:40:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2033 T358080', diff saved to https://phabricator.wikimedia.org/P57659 and previous config saved to /var/cache/conftool/dbconfig/20240222-074042-root.json [07:40:48] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [07:42:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: After migration', diff saved to https://phabricator.wikimedia.org/P57660 and previous config saved to /var/cache/conftool/dbconfig/20240222-074233-root.json [07:43:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:48:07] (03PS1) 10Marostegui: s8-pager.sql: this is not needed anymore [software] - 10https://gerrit.wikimedia.org/r/1005681 [07:48:25] (03PS2) 10Marostegui: s8-pager.sql: this is not needed anymore [software] - 10https://gerrit.wikimedia.org/r/1005681 [07:49:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on parse1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:51:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1468:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:53:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1368:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:54:17] (03PS1) 10Marostegui: es2033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005682 [07:54:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57661 and previous config saved to /var/cache/conftool/dbconfig/20240222-075448-root.json [07:54:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2033.codfw.wmnet with OS bookworm [07:55:10] (03CR) 10Marostegui: [C: 03+2] s8-pager.sql: this is not needed anymore [software] - 10https://gerrit.wikimedia.org/r/1005681 (owner: 10Marostegui) [07:55:31] (03CR) 10Marostegui: [C: 03+2] es2033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005682 (owner: 10Marostegui) [07:55:42] (03Merged) 10jenkins-bot: s8-pager.sql: this is not needed anymore [software] - 10https://gerrit.wikimedia.org/r/1005681 (owner: 10Marostegui) [07:57:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: After migration', diff saved to https://phabricator.wikimedia.org/P57662 and previous config saved to /var/cache/conftool/dbconfig/20240222-075738-root.json [07:58:18] !log taavi@puppetmaster1002 ~ $ sudo systemctl restart apache2 # lots of 'Error 500 on SERVER: Server Error: undefined method `content' for nil:NilClass' in the logs, seems to have helped [07:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T0800) [08:00:04] hoo: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:46] (03PS1) 10Hoo man: Migrate to virtual domain mapping [extensions/Cognate] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005485 (https://phabricator.wikimedia.org/T348526) [08:01:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1468:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:03:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy2002 using scap backport" [extensions/Cognate] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1005467 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [08:03:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy2002 using scap backport" [extensions/Cognate] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005485 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [08:03:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1397:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:04:19] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:04:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on maps1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:04:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on parse1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:05:31] (03Merged) 10jenkins-bot: Migrate to virtual domain mapping [extensions/Cognate] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1005467 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [08:05:38] (03Merged) 10jenkins-bot: Migrate to virtual domain mapping [extensions/Cognate] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005485 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [08:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:30] !log hoo@deploy2002 Started scap: Backport for [[gerrit:1005467|Migrate to virtual domain mapping (T348526)]], [[gerrit:1005485|Migrate to virtual domain mapping (T348526)]] [08:06:36] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:08:04] !log hoo@deploy2002 hoo: Backport for [[gerrit:1005467|Migrate to virtual domain mapping (T348526)]], [[gerrit:1005485|Migrate to virtual domain mapping (T348526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:19] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:12:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2033.codfw.wmnet with reason: host reimage [08:12:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: After migration', diff saved to https://phabricator.wikimedia.org/P57663 and previous config saved to /var/cache/conftool/dbconfig/20240222-081243-root.json [08:13:03] !log hoo@deploy2002 hoo: Continuing with sync [08:14:37] (03CR) 10Majavah: [C: 03+1] clouddb1017: Migration to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005650 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:14:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on parse1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:14:51] (03CR) 10Marostegui: [C: 03+2] clouddb1017: Migration to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005650 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:14:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2033.codfw.wmnet with reason: host reimage [08:16:34] (03CR) 10ArielGlenn: "The full diff looks good to my eyes:" [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [08:16:55] (03PS1) 10Marostegui: Revert "es2033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005686 [08:19:19] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:19:59] (03PS1) 10Marostegui: clouddb1016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005684 (https://phabricator.wikimedia.org/T356838) [08:20:33] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [08:20:38] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [08:20:54] (03CR) 10Marostegui: "Not yet depooled, will do it before merging" [puppet] - 10https://gerrit.wikimedia.org/r/1005684 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:21:15] !log hoo@deploy2002 Finished scap: Backport for [[gerrit:1005467|Migrate to virtual domain mapping (T348526)]], [[gerrit:1005485|Migrate to virtual domain mapping (T348526)]] (duration: 14m 44s) [08:21:20] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:23:44] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 138997 [08:24:17] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 138997 [08:24:19] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:24:26] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 138997 [08:24:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on parse1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:25:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138997 [08:26:05] (03CR) 10Majavah: [C: 03+1] clouddb1016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005684 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:27:48] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:27:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: After migration', diff saved to https://phabricator.wikimedia.org/P57664 and previous config saved to /var/cache/conftool/dbconfig/20240222-082750-root.json [08:28:25] (03PS1) 10Marostegui: Revert "wmnet: Promote es2026 to es2 master" [dns] - 10https://gerrit.wikimedia.org/r/1005687 [08:28:25] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 18779 [08:29:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 18779 [08:29:29] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Promote es2026 to es2 master" [dns] - 10https://gerrit.wikimedia.org/r/1005687 (owner: 10Marostegui) [08:30:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2033.codfw.wmnet with OS bookworm [08:30:53] (03CR) 10Marostegui: [C: 03+2] Revert "es2033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005686 (owner: 10Marostegui) [08:31:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57665 and previous config saved to /var/cache/conftool/dbconfig/20240222-083111-root.json [08:34:19] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:34:48] (PuppetZeroResources) firing: Puppet has failed generate resources on moss-be1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:42:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:42:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:42:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host puppetmaster1002.eqiad.wmnet [08:42:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:42:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:42:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T357189)', diff saved to https://phabricator.wikimedia.org/P57666 and previous config saved to /var/cache/conftool/dbconfig/20240222-084235-arnaudb.json [08:42:45] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:42:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: After migration', diff saved to https://phabricator.wikimedia.org/P57667 and previous config saved to /var/cache/conftool/dbconfig/20240222-084255-root.json [08:44:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1002.eqiad.wmnet [08:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57668 and previous config saved to /var/cache/conftool/dbconfig/20240222-084616-root.json [08:52:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2143,2195].codfw.wmnet,db1187.eqiad.wmnet with reason: Silence for reboot T356240 [08:52:06] !log rolling out prometheus-rsyslog-exporter 1.0.0+git20221110-1 to wikikube nodes - T357616 [08:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:12] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [08:52:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2143,2195].codfw.wmnet,db1187.eqiad.wmnet with reason: Silence for reboot T356240 [08:53:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9566585 (10dcaro) [08:55:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 - depooling db1187 db2143 db2195', diff saved to https://phabricator.wikimedia.org/P57669 and previous config saved to /var/cache/conftool/dbconfig/20240222-085521-arnaudb.json [08:55:36] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1180.eqiad.wmnet [08:55:53] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2143.codfw.wmnet [08:56:06] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2195.codfw.wmnet [08:56:18] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611#9566586 (10hashar) That has been solved by setting PIP_FIN... [08:58:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: After migration', diff saved to https://phabricator.wikimedia.org/P57670 and previous config saved to /var/cache/conftool/dbconfig/20240222-085800-root.json [08:58:24] (03PS1) 10Muehlenhoff: Remove puppetmaster1002 from Puppet 5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1005708 [08:59:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable $wgLocalHTTPProxy on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004135 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [08:59:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1180.eqiad.wmnet [08:59:53] (03CR) 10Jelto: [C: 03+1] "we can try that for now until puppetmaster1002 is fixed/replaced, so lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1005708 (owner: 10Muehlenhoff) [08:59:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove puppetmaster1002 from Puppet 5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1005708 (owner: 10Muehlenhoff) [09:00:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2195.codfw.wmnet [09:01:10] (03CR) 10Muehlenhoff: [C: 03+2] Remove puppetmaster1002 from Puppet 5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1005708 (owner: 10Muehlenhoff) [09:01:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2143.codfw.wmnet [09:01:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57671 and previous config saved to /var/cache/conftool/dbconfig/20240222-090121-root.json [09:03:39] !log restart prometheus@k8s in eqiad - T343529 [09:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:44] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [09:03:50] (03CR) 10Brouberol: [C: 03+1] "The diff looks good. I trust you on the actual absented jobs." [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) (owner: 10Joal) [09:04:19] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:06:50] (03PS3) 10Ayounsi: users: add jwheeler to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1004187 (https://phabricator.wikimedia.org/T357731) (owner: 10Hnowlan) [09:07:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1494:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:07:49] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:08:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on dragonfly-supernode1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:09:19] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:09:51] (KubernetesAPINotScrapable) resolved: (2) k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:10:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: Connection errors from puppetmaster1002 to puppetdb - https://phabricator.wikimedia.org/T358187#9566631 (10MoritzMuehlenhoff) [09:10:50] ^ we're seeing quite a few puppet errors at the time, with messages such as "Error 500 on SERVER: Server Error: Could not retrieve facts for xxx.xxx.wmnet: Failed to find facts from PuppetDB at puppet:8140: undefined method `content' for nil:NilClass" [09:11:08] Ah, I think this is related to the ticket m.oritz just created [09:12:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1494:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:13:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on dragonfly-supernode1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:14:19] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1372:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:14:45] (03PS2) 10Ayounsi: Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) [09:14:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on moss-be1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:15:36] (03CR) 10Ayounsi: Update brion to bvibber (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [09:16:13] (03CR) 10CI reject: [V: 04-1] Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [09:16:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: Connection errors from puppetmaster1002 to puppetdb - https://phabricator.wikimedia.org/T358187#9566643 (10Jelto) > A restart of Apache and a reboot of puppetmaster1002 did not help. This restarts //probably// had different effects. It seems the... [09:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57672 and previous config saved to /var/cache/conftool/dbconfig/20240222-091626-root.json [09:16:38] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1005021 (owner: 10Ayounsi) [09:17:35] (03PS1) 10Filippo Giunchedi: sre: move PuppetZeroResources to warning [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) [09:17:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1494:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:18:48] (03CR) 10Muehlenhoff: Update brion to bvibber (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [09:18:49] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on dragonfly-supernode1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:19:19] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1364:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:20:14] (03CR) 10Volans: [C: 03+1] "Sorry for the trouble. Interesting, I wonder why PCC didn't catch it, it was run on like 55 hosts with the resource..." [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [09:22:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:22:55] (03PS3) 10Ayounsi: Update brion to bvibber [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) [09:23:18] (03CR) 10Ayounsi: Update brion to bvibber (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [09:23:49] (03CR) 10Muehlenhoff: "Do you have a list of servers which failed? Might have been unrelated to this patch, but https://phabricator.wikimedia.org/T358187 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [09:25:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57673 and previous config saved to /var/cache/conftool/dbconfig/20240222-092503-arnaudb.json [09:25:54] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:26:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57674 and previous config saved to /var/cache/conftool/dbconfig/20240222-092609-arnaudb.json [09:29:19] (PuppetZeroResources) resolved: (5) Puppet has failed generate resources on mw1364:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:29:49] (03CR) 10Volans: "Adding my 2 cents to the discussion" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [09:31:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57675 and previous config saved to /var/cache/conftool/dbconfig/20240222-093130-root.json [09:33:10] (03CR) 10Muehlenhoff: "Looks good, but before merging let's wait for feedback on https://phabricator.wikimedia.org/T358044#9562598" [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [09:38:08] (03CR) 10MVernon: [C: 03+1] "Thanks, there was a lot of alert spam this morning!" [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [09:39:32] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move PuppetZeroResources to warning [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [09:39:36] (03CR) 10Jelto: "one note from todays incident: the total failure was around 7% globally (because only in eqiad and puppet 5). WidespreadPuppetFailure aler" [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [09:39:46] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] sre: move PuppetZeroResources to warning [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [09:40:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57677 and previous config saved to /var/cache/conftool/dbconfig/20240222-094008-arnaudb.json [09:41:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57678 and previous config saved to /var/cache/conftool/dbconfig/20240222-094114-arnaudb.json [09:42:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T357189)', diff saved to https://phabricator.wikimedia.org/P57679 and previous config saved to /var/cache/conftool/dbconfig/20240222-094257-arnaudb.json [09:43:04] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:46:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57680 and previous config saved to /var/cache/conftool/dbconfig/20240222-094635-root.json [09:47:24] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q3): Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#9566703 (10fgiunchedi) [09:48:37] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:07] (03CR) 10Volans: "the idea looks ok, minor nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [09:49:50] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q3): Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#9566716 (10fgiunchedi) >>! In T357747#9562810, @MatthewVernon wrote: > I think the proposed table should look like this? > > | # weeks | GBs... [09:53:36] (03CR) 10Muehlenhoff: "Not sure if this is really an adequate replacement? For the https://phabricator.wikimedia.org/T358187 incident I don't see a WidespreadPup" [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [09:55:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57681 and previous config saved to /var/cache/conftool/dbconfig/20240222-095513-arnaudb.json [09:56:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57682 and previous config saved to /var/cache/conftool/dbconfig/20240222-095619-arnaudb.json [09:58:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P57683 and previous config saved to /var/cache/conftool/dbconfig/20240222-095804-arnaudb.json [10:01:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57684 and previous config saved to /var/cache/conftool/dbconfig/20240222-100140-root.json [10:02:56] (03CR) 10Fabfur: [V: 03+1] haproxy: configure extended logging (preparatory for Benthos) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [10:10:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57685 and previous config saved to /var/cache/conftool/dbconfig/20240222-101018-arnaudb.json [10:11:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57686 and previous config saved to /var/cache/conftool/dbconfig/20240222-101123-arnaudb.json [10:13:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P57687 and previous config saved to /var/cache/conftool/dbconfig/20240222-101310-arnaudb.json [10:18:58] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Good point re: WidespreadPuppetFailure I have reopened to investigate https://phabricator.wikimedia.org/T357893" [alerts] - 10https://gerrit.wikimedia.org/r/1005712 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [10:26:37] (03PS1) 10JMeybohm: Don't restart rsyslog on updates, kill exporter instead [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) [10:26:53] (03PS2) 10Clément Goubert: Enable $wgLocalHTTPProxy on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004135 (https://phabricator.wikimedia.org/T298265) [10:28:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T357189)', diff saved to https://phabricator.wikimedia.org/P57688 and previous config saved to /var/cache/conftool/dbconfig/20240222-102817-arnaudb.json [10:28:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:28:23] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:28:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:28:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:29:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:29:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T357189)', diff saved to https://phabricator.wikimedia.org/P57689 and previous config saved to /var/cache/conftool/dbconfig/20240222-102906-arnaudb.json [10:29:42] (03CR) 10Muehlenhoff: [C: 03+2] Add puppetised java.security config file for hardened TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/1005095 (https://phabricator.wikimedia.org/T357749) (owner: 10Muehlenhoff) [10:30:50] (03PS1) 10EoghanGaffney: [apt/gitlab] Add new package for Gitlab update [puppet] - 10https://gerrit.wikimedia.org/r/1005721 (https://phabricator.wikimedia.org/T358182) [10:31:01] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5 [10:31:05] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8 [10:31:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005721 (https://phabricator.wikimedia.org/T358182) (owner: 10EoghanGaffney) [10:31:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T357189)', diff saved to https://phabricator.wikimedia.org/P57690 and previous config saved to /var/cache/conftool/dbconfig/20240222-103125-arnaudb.json [10:31:53] (03CR) 10Marostegui: [C: 03+2] clouddb1016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005684 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [10:32:07] (03CR) 10EoghanGaffney: [C: 03+2] [apt/gitlab] Add new package for Gitlab update [puppet] - 10https://gerrit.wikimedia.org/r/1005721 (https://phabricator.wikimedia.org/T358182) (owner: 10EoghanGaffney) [10:35:31] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8 [10:35:34] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5 [10:38:38] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:43:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (but this change also applies to the production IDPs let's disable Puppet on idp1002/2002 before rollout)." [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [10:43:38] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:46:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P57692 and previous config saved to /var/cache/conftool/dbconfig/20240222-104632-arnaudb.json [10:46:36] 10SRE, 10Wikimedia-Mailing-lists: Message-ID: delayed by 3 days - https://phabricator.wikimedia.org/T358198#9566926 (10saper) [10:47:39] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9566949 (10Clement_Goubert) We've talked this over, and while doing swagger checks made sense when there were just a few canaries o... [10:48:27] (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:41] 10SRE, 10Wikimedia-Mailing-lists: Message-ID: delayed by 3 days - https://phabricator.wikimedia.org/T358198#9566956 (10taavi) [10:50:54] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020#9566958 (10taavi) [10:52:40] 10SRE, 10Wikimedia-Mailing-lists: Message-ID: delayed by 3 days - https://phabricator.wikimedia.org/T358198#9566926 (10saper) p:05Triage→03Low [10:52:51] 10SRE, 10Wikimedia-Mailing-lists: Message-ID: delayed by 3 days - https://phabricator.wikimedia.org/T358198#9566984 (10saper) 05duplicate→03Open [10:54:03] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057#9566992 (10MoritzMuehlenhoff) [10:56:18] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057#9566993 (10MoritzMuehlenhoff) 05Open→03Resolved This is resolved [10:56:38] (03PS1) 10Alexandros Kosiaris: ClusterConfig: Add kube-wiki-parsoid test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005723 (https://phabricator.wikimedia.org/T357392) [11:00:04] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1100) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1100) [11:01:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P57693 and previous config saved to /var/cache/conftool/dbconfig/20240222-110138-arnaudb.json [11:03:46] (03CR) 10Volans: "I'm getting:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [11:09:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1028 T358180', diff saved to https://phabricator.wikimedia.org/P57694 and previous config saved to /var/cache/conftool/dbconfig/20240222-110914-root.json [11:09:20] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [11:09:47] (03PS1) 10Marostegui: es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005724 (https://phabricator.wikimedia.org/T358180) [11:12:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1028.eqiad.wmnet with OS bookworm [11:12:41] (03CR) 10Marostegui: [C: 03+2] es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005724 (https://phabricator.wikimedia.org/T358180) (owner: 10Marostegui) [11:16:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T357189)', diff saved to https://phabricator.wikimedia.org/P57695 and previous config saved to /var/cache/conftool/dbconfig/20240222-111644-arnaudb.json [11:16:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:16:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:17:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:17:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T357189)', diff saved to https://phabricator.wikimedia.org/P57696 and previous config saved to /var/cache/conftool/dbconfig/20240222-111706-arnaudb.json [11:19:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T357189)', diff saved to https://phabricator.wikimedia.org/P57697 and previous config saved to /var/cache/conftool/dbconfig/20240222-111925-arnaudb.json [11:26:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1028.eqiad.wmnet with reason: host reimage [11:29:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1028.eqiad.wmnet with reason: host reimage [11:34:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P57698 and previous config saved to /var/cache/conftool/dbconfig/20240222-113432-arnaudb.json [11:42:42] jouncebot: nowandnext [11:42:42] For the next 0 hour(s) and 17 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1100) [11:42:42] For the next 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1100) [11:42:42] In 1 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1300) [11:49:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P57699 and previous config saved to /var/cache/conftool/dbconfig/20240222-114938-arnaudb.json [11:50:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1028.eqiad.wmnet with OS bookworm [11:51:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [11:51:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [11:51:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics-privatedata-users for jwheeler - https://phabricator.wikimedia.org/T357731#9567233 (10hnowlan) 05Open→03Resolved a:03hnowlan Done [11:52:50] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: Connection errors from puppetmaster1002 to puppetdb - https://phabricator.wikimedia.org/T358187#9567253 (10cmooney) Definitely kind of strange. IP connectivity between these hosts is ok: `lines=15 cmooney@es1031:~$ ping puppetmaster1002 PING pup... [11:52:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [11:53:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [11:55:23] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrading gitlab [12:02:37] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrading gitlab [12:04:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T357189)', diff saved to https://phabricator.wikimedia.org/P57700 and previous config saved to /var/cache/conftool/dbconfig/20240222-120445-arnaudb.json [12:04:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [12:04:54] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:05:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [12:05:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T357189)', diff saved to https://phabricator.wikimedia.org/P57701 and previous config saved to /var/cache/conftool/dbconfig/20240222-120518-arnaudb.json [12:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T357189)', diff saved to https://phabricator.wikimedia.org/P57702 and previous config saved to /var/cache/conftool/dbconfig/20240222-120737-arnaudb.json [12:22:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P57703 and previous config saved to /var/cache/conftool/dbconfig/20240222-122244-arnaudb.json [12:27:44] (03CR) 10Volans: "immediate replies, I didn't checked yet the new PS" [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [12:30:26] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567347 (10MoritzMuehlenhoff) @AndrewTavis_WMDE Thanks! This is a long task and to make things explicit: Is the summary below... [12:33:38] (03PS3) 10Arturo Borrero Gonzalez: openstack: nova: compute: depend on the ceph config file being deployed [puppet] - 10https://gerrit.wikimedia.org/r/1005733 (https://phabricator.wikimedia.org/T358101) [12:33:53] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005733 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [12:37:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P57704 and previous config saved to /var/cache/conftool/dbconfig/20240222-123750-arnaudb.json [12:38:01] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1034: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005750 (https://phabricator.wikimedia.org/T319184) [12:39:17] (03PS4) 10Arturo Borrero Gonzalez: openstack: nova: compute: depend on the ceph config file being deployed [puppet] - 10https://gerrit.wikimedia.org/r/1005733 (https://phabricator.wikimedia.org/T358101) [12:39:42] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005733 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [12:42:06] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567384 (10AndrewTavis_WMDE) Thanks for checking in @MoritzMuehlenhoff! A correction to one of your points: - Membership in a... [12:43:55] (03CR) 10Marostegui: [C: 03+2] Revert "es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005691 (owner: 10Marostegui) [12:44:01] (03CR) 10Majavah: [C: 03+1] openstack: nova: compute: depend on the ceph config file being deployed [puppet] - 10https://gerrit.wikimedia.org/r/1005733 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [12:44:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57705 and previous config saved to /var/cache/conftool/dbconfig/20240222-124438-root.json [12:45:26] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrading gitlab [12:47:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: compute: depend on the ceph config file being deployed [puppet] - 10https://gerrit.wikimedia.org/r/1005733 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [12:52:32] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrading gitlab [12:52:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T357189)', diff saved to https://phabricator.wikimedia.org/P57706 and previous config saved to /var/cache/conftool/dbconfig/20240222-125257-arnaudb.json [12:52:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [12:53:04] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:53:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [12:53:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T357189)', diff saved to https://phabricator.wikimedia.org/P57707 and previous config saved to /var/cache/conftool/dbconfig/20240222-125319-arnaudb.json [12:55:07] (03CR) 10Volans: "Great! Final nits and it's ready!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [12:55:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T357189)', diff saved to https://phabricator.wikimedia.org/P57708 and previous config saved to /var/cache/conftool/dbconfig/20240222-125538-arnaudb.json [12:55:52] (03PS1) 10Filippo Giunchedi: sre: fix WidespreadPuppetFailure logic for no resources [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) [12:56:14] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567420 (10MoritzMuehlenhoff) >>! In T356279#9567384, @AndrewTavis_WMDE wrote: > Thanks for checking in @MoritzMuehlenhoff! A... [12:57:19] (03PS9) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) [12:57:31] (03PS1) 10Muehlenhoff: Remove goransm from analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1005753 (https://phabricator.wikimedia.org/T356279) [12:58:29] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567427 (10AndrewTavis_WMDE) Thank you for the help with this, @MoritzMuehlenhoff! Please also add in @M... [12:59:42] (03PS2) 10Arturo Borrero Gonzalez: cloudvirt1034: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005750 (https://phabricator.wikimedia.org/T319184) [12:59:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57709 and previous config saved to /var/cache/conftool/dbconfig/20240222-125943-root.json [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1300) [13:00:21] (03CR) 10Majavah: [C: 03+1] cloudvirt1034: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005750 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:00:35] (03CR) 10Brouberol: [C: 03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:01:34] (03CR) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [13:01:40] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1034.eqiad.wmnet with OS bookworm [13:01:51] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9567450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1034.eqiad.wmnet with OS... [13:02:11] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrading gitlab [13:02:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1034: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005750 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:03:01] !log ms-eqiad set ACL {"read-only":["mw:backup"]} T269108 [13:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:06] T269108: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 [13:03:25] (03PS1) 10Muehlenhoff: Remove cumin1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1005755 (https://phabricator.wikimedia.org/T353419) [13:03:25] (SystemdUnitFailed) resolved: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:43] (03CR) 10Filippo Giunchedi: "I see what you are getting at here, and in general I agree if we could get less disruptive upgrades that would be great. Though tbh it see" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [13:05:02] !log ms-codfw set ACL {"read-only":["mw:backup"]} T269108 [13:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:08] (03PS16) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [13:05:20] (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:06:36] 10SRE, 10MW-on-K8s, 10RESTBase, 10serviceops: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9567469 (10Clement_Goubert) [13:06:48] (03PS2) 10Filippo Giunchedi: thanos: fix bucket-query tools import [puppet] - 10https://gerrit.wikimedia.org/r/1005442 (https://phabricator.wikimedia.org/T351927) [13:07:50] 10SRE, 10MW-on-K8s, 10RESTBase, 10serviceops: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9567482 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [13:08:00] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9567484 (10Clement_Goubert) [13:10:22] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: provision thanos-downsample datasources [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi) [13:10:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P57710 and previous config saved to /var/cache/conftool/dbconfig/20240222-131045-arnaudb.json [13:12:43] jouncebot: next [13:12:43] In 0 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1400) [13:13:04] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9567499 (10hnowlan) [13:13:11] !log bounce grafana to apply new datasources [13:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:30] (03CR) 10Volans: "Nice! Almost ready" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:13:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove goransm from analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1005753 (https://phabricator.wikimedia.org/T356279) (owner: 10Muehlenhoff) [13:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57711 and previous config saved to /var/cache/conftool/dbconfig/20240222-131448-root.json [13:16:38] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567502 (10MoritzMuehlenhoff) 05Stalled→03Open a:03MoritzMuehlenhoff [13:16:44] (03CR) 10Btullis: [C: 03+2] Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:17:11] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:17:33] (03Merged) 10jenkins-bot: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:17:52] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108#9567508 (10MatthewVernon) @jcrespo can you try now, please? I constructed the appropriate URL thus: ` matthew@tsk:~/puppet$ python3 Python 3.9.2 (default, Fe... [13:18:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:18:18] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [13:18:57] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix bucket-query tools import [puppet] - 10https://gerrit.wikimedia.org/r/1005442 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:19:11] 10SRE, 10Infrastructure-Foundations: Migrate Spicerack logs from cumin1001 to cumin1002? - https://phabricator.wikimedia.org/T353523#9567511 (10Volans) a:03Volans [13:20:20] (03PS2) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [13:20:50] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [13:20:58] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:21:02] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:21:06] (03CR) 10Filippo Giunchedi: [C: 03+1] C:puppetmaster::monitoring Disable Icinga merge check. [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:23:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:23:38] (03PS3) 10Hashar: Change build image user from root to nobody [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 [13:25:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P57712 and previous config saved to /var/cache/conftool/dbconfig/20240222-132551-arnaudb.json [13:25:56] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:28:12] (03PS2) 10Slyngshede: C:puppetmaster::monitoring Disable Icinga merge check. [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) [13:29:29] (03PS1) 10Filippo Giunchedi: sre: deploy pki alerts to eqiad/codfw only [alerts] - 10https://gerrit.wikimedia.org/r/1005758 (https://phabricator.wikimedia.org/T354255) [13:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57713 and previous config saved to /var/cache/conftool/dbconfig/20240222-132953-root.json [13:30:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1430/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:30:42] (03PS1) 10Btullis: Update the image tags for superset to reflect the actual tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005759 (https://phabricator.wikimedia.org/T357890) [13:31:24] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::monitoring Disable Icinga merge check. [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:31:38] (03CR) 10Brouberol: [C: 03+1] Update the image tags for superset to reflect the actual tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005759 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:32:26] (03CR) 10Btullis: [C: 03+2] Update the image tags for superset to reflect the actual tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005759 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:32:32] We'll be restarting gitlab for an update in approximately 1 hour. This should last less than 5 minutes. [13:33:43] (03Merged) 10jenkins-bot: Update the image tags for superset to reflect the actual tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005759 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:34:05] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:38:45] (03PS1) 10Btullis: Fix errant bracket in nginx configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005761 (https://phabricator.wikimedia.org/T357890) [13:39:55] (03CR) 10Btullis: [C: 03+2] Fix errant bracket in nginx configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005761 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:40:17] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1034 [13:40:35] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1034 [13:41:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T357189)', diff saved to https://phabricator.wikimedia.org/P57714 and previous config saved to /var/cache/conftool/dbconfig/20240222-134059-arnaudb.json [13:41:00] (03Merged) 10jenkins-bot: Fix errant bracket in nginx configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005761 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:41:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [13:41:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:41:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [13:41:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:41:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T357189)', diff saved to https://phabricator.wikimedia.org/P57715 and previous config saved to /var/cache/conftool/dbconfig/20240222-134120-arnaudb.json [13:42:23] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9567605 (10aborrero) [13:43:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T357189)', diff saved to https://phabricator.wikimedia.org/P57716 and previous config saved to /var/cache/conftool/dbconfig/20240222-134340-arnaudb.json [13:44:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:44:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57717 and previous config saved to /var/cache/conftool/dbconfig/20240222-134458-root.json [13:45:10] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:45:50] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:46:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:46:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1034.eqiad.wmnet with OS bookworm [13:47:10] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9567631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1034.eqiad.wmnet with OS book... [13:48:05] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108#9567633 (10jcrespo) Thank you a lot, as I mentioned in private, I will try to run the automatic downloads back again with the new user, if it works we will be... [13:48:37] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:38] (03PS1) 10Btullis: Bump superset chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005762 (https://phabricator.wikimedia.org/T357890) [13:48:55] (03PS6) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [13:49:06] (03CR) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [13:49:49] (03CR) 10Btullis: [C: 03+2] Bump superset chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005762 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:50:42] (03Merged) 10jenkins-bot: Bump superset chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005762 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [13:51:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:51:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:51:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:51:31] (03PS10) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) [13:51:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:51:58] (03CR) 10Ayounsi: Netbox: add generic function to execute a Netbox script (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:52:08] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:52:17] (03CR) 10CI reject: [V: 04-1] C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [13:52:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:52:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:52:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:52:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:52:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:53:09] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:54:39] 10SRE, 10Infrastructure-Foundations, 10netops: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9567655 (10cmooney) p:05Triage→03Low [13:55:15] 10SRE, 10Infrastructure-Foundations, 10netops: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9567676 (10cmooney) [13:55:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9567677 (10cmooney) [13:57:51] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: compute: drop version matrix split [puppet] - 10https://gerrit.wikimedia.org/r/1005763 [13:57:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: compute: extend dependency on ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) [13:57:57] (03CR) 10CI reject: [V: 04-1] Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:58:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P57718 and previous config saved to /var/cache/conftool/dbconfig/20240222-135846-arnaudb.json [13:58:55] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [13:59:03] (03CR) 10CI reject: [V: 04-1] openstack: nova: compute: drop version matrix split [puppet] - 10https://gerrit.wikimedia.org/r/1005763 (owner: 10Arturo Borrero Gonzalez) [13:59:14] (03CR) 10CI reject: [V: 04-1] openstack: nova: compute: extend dependency on ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [14:00:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57719 and previous config saved to /var/cache/conftool/dbconfig/20240222-140003-root.json [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1400) [14:00:04] claime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:26] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424#9567694 (10RobH) [14:00:34] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: compute: drop version matrix split [puppet] - 10https://gerrit.wikimedia.org/r/1005763 [14:00:36] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: compute: extend dependency on ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) [14:01:08] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [14:01:10] I’m in a meeting but can deploy later if nobody else is around [14:01:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1005730 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [14:02:06] 10SRE, 10ops-codfw, 10ops-eqiad: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424#9567706 (10RobH) 05Open→03Stalled Both disconnects are currently pending with the vendors. EQ's has a ticket submitted directly where CyrusOne is via our account reps. Updates to bot... [14:02:59] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [14:03:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [14:04:34] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1005731 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [14:04:44] (03CR) 10Volans: "LGTM, one detail/question" [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [14:05:19] claime: will you self-deploy or do you need someone else to do that? [14:11:20] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419#9567739 (10Volans) [14:13:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P57720 and previous config saved to /var/cache/conftool/dbconfig/20240222-141353-arnaudb.json [14:14:06] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 93 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:14:14] 10SRE, 10Infrastructure-Foundations: Migrate Spicerack logs from cumin1001 to cumin1002? - https://phabricator.wikimedia.org/T353523#9567737 (10Volans) 05Open→03Resolved I've copied the logs in `/var/log/{cumin,debdeploy,spicerack}` on `cumin1001` to `/var/log/cumin1001` on `cumin1002` and `cumin2002` usin... [14:15:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57721 and previous config saved to /var/cache/conftool/dbconfig/20240222-141508-root.json [14:17:56] (03PS1) 10Volans: validators: dcim.device fix asset tag regex [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1005768 (https://phabricator.wikimedia.org/T356633) [14:18:56] 10ops-eqiad, 10DC-Ops: asset tag typos - audit and correct - https://phabricator.wikimedia.org/T358223#9567756 (10RobH) p:05Triage→03Medium [14:19:03] 10ops-eqiad, 10DC-Ops: asset tag typos - audit and correct - https://phabricator.wikimedia.org/T358223#9567756 (10RobH) [14:19:06] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 33 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:19:10] 10ops-eqiad, 10DC-Ops: asset tag typos - audit and correct - https://phabricator.wikimedia.org/T358223#9567756 (10RobH) [14:19:57] (03PS1) 10Arturo Borrero Gonzalez: profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 [14:20:38] (03PS2) 10Arturo Borrero Gonzalez: profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 [14:22:00] (03PS1) 10Alexandros Kosiaris: mw-parsoid: Add parsoid.discovery.wmnet in cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005770 (https://phabricator.wikimedia.org/T357392) [14:24:24] (03CR) 10CI reject: [V: 04-1] profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 (owner: 10Arturo Borrero Gonzalez) [14:25:18] taavi: I'll self deploy [14:25:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9567796 (10MoritzMuehlenhoff) [14:26:28] (03PS3) 10Arturo Borrero Gonzalez: profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 [14:27:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004135 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [14:27:44] (03Merged) 10jenkins-bot: Enable $wgLocalHTTPProxy on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004135 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [14:27:54] (sorry I got called on the phone...) [14:28:09] !log cgoubert@deploy2002 Started scap: Backport for [[gerrit:1004135|Enable $wgLocalHTTPProxy on group1 wikis (T298265)]] [14:28:15] T298265: Have internal MediaWiki to MediaWiki HTTP requests use an envoyproxy on appservers - https://phabricator.wikimedia.org/T298265 [14:28:19] (03PS4) 10Arturo Borrero Gonzalez: profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 [14:29:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T357189)', diff saved to https://phabricator.wikimedia.org/P57722 and previous config saved to /var/cache/conftool/dbconfig/20240222-142859-arnaudb.json [14:29:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [14:29:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:29:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [14:29:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T357189)', diff saved to https://phabricator.wikimedia.org/P57723 and previous config saved to /var/cache/conftool/dbconfig/20240222-142921-arnaudb.json [14:29:41] (03CR) 10CI reject: [V: 04-1] profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 (owner: 10Arturo Borrero Gonzalez) [14:29:43] !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:1004135|Enable $wgLocalHTTPProxy on group1 wikis (T298265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:03] (03PS5) 10Arturo Borrero Gonzalez: profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 [14:30:30] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9567826 (10MoritzMuehlenhoff) [14:31:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T357189)', diff saved to https://phabricator.wikimedia.org/P57724 and previous config saved to /var/cache/conftool/dbconfig/20240222-143141-arnaudb.json [14:31:47] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005769 (owner: 10Arturo Borrero Gonzalez) [14:34:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] profile_openstack_base_nova_compute_service_spec: unbreak tests [puppet] - 10https://gerrit.wikimedia.org/r/1005769 (owner: 10Arturo Borrero Gonzalez) [14:37:40] !log cgoubert@deploy2002 cgoubert: Continuing with sync [14:40:50] 10SRE-swift-storage, 10MediaWiki-Uploading, 10Patch-For-Review, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9567876 (10MatthewVernon) Here's the relevant logs, sorted by t... [14:41:08] (03PS2) 10Alexandros Kosiaris: mw-parsoid: Add parsoid.discovery.wmnet in cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005770 (https://phabricator.wikimedia.org/T357392) [14:41:44] (03CR) 10Muehlenhoff: [C: 03+2] Make apt2002 a repository server [puppet] - 10https://gerrit.wikimedia.org/r/1005731 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [14:44:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bullseye [14:44:14] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9567887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye [14:44:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-redacteddb1001.eqiad.wmnet with OS bullseye [14:44:24] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9567888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye executed with errors: - an-redacteddb1001 (... [14:45:56] !log cgoubert@deploy2002 Finished scap: Backport for [[gerrit:1004135|Enable $wgLocalHTTPProxy on group1 wikis (T298265)]] (duration: 17m 46s) [14:46:02] T298265: Have internal MediaWiki to MediaWiki HTTP requests use an envoyproxy on appservers - https://phabricator.wikimedia.org/T298265 [14:46:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P57725 and previous config saved to /var/cache/conftool/dbconfig/20240222-144648-arnaudb.json [14:47:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove useless fixtures from services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002917 (owner: 10Giuseppe Lavagetto) [14:47:37] (03PS1) 10Cathal Mooney: Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) [14:48:39] (03PS2) 10Cathal Mooney: Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) [14:49:52] (03Merged) 10jenkins-bot: Remove useless fixtures from services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002917 (owner: 10Giuseppe Lavagetto) [14:50:07] (03CR) 10JMeybohm: "To be fair: The current solution also does not work with multiple instances." [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [14:50:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-parsoid: Add parsoid.discovery.wmnet in cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005770 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [14:50:52] (03CR) 10JMeybohm: "Like without the typo... 😊" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [14:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 1%: After recloning', diff saved to https://phabricator.wikimedia.org/P57726 and previous config saved to /var/cache/conftool/dbconfig/20240222-145120-root.json [14:51:25] (03PS3) 10Cathal Mooney: Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) [14:51:46] (03Merged) 10jenkins-bot: mw-parsoid: Add parsoid.discovery.wmnet in cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005770 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [14:54:25] (03PS4) 10Cathal Mooney: Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) [14:54:32] (03CR) 10Ssingh: "Looking at my history, parse1017, parse1018, clouddumps1002. In some of these, you can see the task you mentioned above but some just had " [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [14:55:08] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:55:18] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:55:35] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [14:55:42] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [14:57:32] (03CR) 10Muehlenhoff: "Thanks! clouddumps1002 uses Puppet 7, so this confirms that the issue was in fact unrelated to https://phabricator.wikimedia.org/T358187." [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [14:59:39] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: Connection errors from puppetmaster1002 to puppetdb - https://phabricator.wikimedia.org/T358187#9567991 (10ssingh) When I looked at this last night as the alerts were coming in, I noticed that some hosts were not reporting the connection failure b... [15:01:06] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrading gitlab [15:01:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P57727 and previous config saved to /var/cache/conftool/dbconfig/20240222-150154-arnaudb.json [15:03:38] (03PS1) 10Hnowlan: kubernetes: move all remaining eligible jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005776 (https://phabricator.wikimedia.org/T354791) [15:04:56] (03PS2) 10Hnowlan: kubernetes: move all remaining eligible eqiad jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005776 (https://phabricator.wikimedia.org/T354791) [15:06:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 5%: After recloning', diff saved to https://phabricator.wikimedia.org/P57728 and previous config saved to /var/cache/conftool/dbconfig/20240222-150626-root.json [15:11:07] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9568026 (10akosiaris) [15:11:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:12:45] !log Bump weight of old parsoid hosts from 10 to 110. This is a noop right now but will makes calculations later spelled out in T357392 possible. [15:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] !log akosiaris@cumin1002 conftool action : set/weight=110; selector: service=parsoid-php,name=(pars.*|mw.*) [15:12:51] T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392 [15:13:21] !log akosiaris@cumin1002 conftool action : set/weight=1; selector: service=parsoid-php,name=kubernetes.* [15:13:52] (03CR) 10Filippo Giunchedi: [C: 04-1] "The current solution would work with multiple instances because typically they are partof/bindsto (or the equivalent) to rsyslog.service, " [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [15:14:22] (03CR) 10Ayounsi: [C: 03+1] validators: dcim.device fix asset tag regex [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1005768 (https://phabricator.wikimedia.org/T356633) (owner: 10Volans) [15:14:49] (03PS11) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) [15:15:11] !log T357392 pool 46 kubernetes hosts of parsoid-php with a weight of 1. Since the 42 parse hosts are at weight 110, that means 1% goes to mw-parsoid deployment, aka mw-on-k8s [15:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:18] (03CR) 10Volans: [C: 03+2] validators: dcim.device fix asset tag regex [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1005768 (https://phabricator.wikimedia.org/T356633) (owner: 10Volans) [15:15:32] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=parsoid-php,name=kubernetes.* [15:15:54] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS bullseye [15:17:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T357189)', diff saved to https://phabricator.wikimedia.org/P57729 and previous config saved to /var/cache/conftool/dbconfig/20240222-151701-arnaudb.json [15:17:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [15:17:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:17:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [15:17:28] (03Merged) 10jenkins-bot: validators: dcim.device fix asset tag regex [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1005768 (https://phabricator.wikimedia.org/T356633) (owner: 10Volans) [15:17:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T357189)', diff saved to https://phabricator.wikimedia.org/P57730 and previous config saved to /var/cache/conftool/dbconfig/20240222-151733-arnaudb.json [15:17:50] (03CR) 10Jelto: [V: 03+1 C: 03+2] etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:18:10] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:18:38] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:18:50] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers kubernetes2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:18:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers kubernetes2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:19:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T357189)', diff saved to https://phabricator.wikimedia.org/P57731 and previous config saved to /var/cache/conftool/dbconfig/20240222-151952-arnaudb.json [15:20:59] (03CR) 10Ebernhardson: "dc" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005635 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [15:21:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 10%: After recloning', diff saved to https://phabricator.wikimedia.org/P57732 and previous config saved to /var/cache/conftool/dbconfig/20240222-152131-root.json [15:25:14] (03CR) 10JHathaway: [C: 03+2] "I don't see any past puppet failures on clouddumps1002, so I this issue was caused by https://phabricator.wikimedia.org/T358187, not by th" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [15:25:33] (03PS1) 10Hnowlan: kubernetes: move all remaining eligible codfw jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) [15:26:00] (03CR) 10JHathaway: [C: 04-1] "I don't think this is needed, as the cause was, https://phabricator.wikimedia.org/T358187" [puppet] - 10https://gerrit.wikimedia.org/r/1005482 (owner: 10Ssingh) [15:26:51] (03CR) 10CI reject: [V: 04-1] kubernetes: move all remaining eligible codfw jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:27:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [15:27:23] !log installing glib2.0 security updates on bullseye [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:26] (03CR) 10Ssingh: "You are right, sorry, it was https://puppetboard.wikimedia.org/report/cloudcephosd1029.eqiad.wmnet/bccf677965e65173ff9d2b307ffd5f69785c0cb" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [15:28:44] (03Abandoned) 10Ssingh: Revert "etcd: disable the diff output for client config with passwords" [puppet] - 10https://gerrit.wikimedia.org/r/1005482 (owner: 10Ssingh) [15:28:48] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:30:04] (03CR) 10JHathaway: [C: 03+2] "no problem at all, the error message was definitely confusing!" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [15:32:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [15:33:14] (03CR) 10Ayounsi: [C: 03+2] Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:33:38] (03PS2) 10Hnowlan: kubernetes: move all remaining eligible codfw jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) [15:34:43] (03CR) 10Herron: [C: 03+1] sre: fix WidespreadPuppetFailure logic for no resources [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [15:35:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P57733 and previous config saved to /var/cache/conftool/dbconfig/20240222-153459-arnaudb.json [15:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: After recloning', diff saved to https://phabricator.wikimedia.org/P57734 and previous config saved to /var/cache/conftool/dbconfig/20240222-153636-root.json [15:37:35] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9568184 (10cmooney) [15:38:41] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9568181 (10cmooney) 05Open→03Resolved a:03cmooney Closing this, thanks all for the help! [15:39:08] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster [15:39:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9568189 (10MoritzMuehlenhoff) [15:39:25] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster (duration: 00m 16s) [15:40:20] (03Merged) 10jenkins-bot: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:40:36] (03CR) 10Jelto: [C: 03+1] "lgtm, but the expression for missing resource fires on 3% and agent failures on 10%. I can not tell if that's intended and agent failures " [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [15:42:29] (03PS1) 10Bking: rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) [15:43:40] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [15:44:59] (03PS2) 10Bking: rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) [15:46:06] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [15:46:19] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2031-2032].codfw.wmnet with reason: T355868 [15:46:22] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: move all remaining eligible eqiad jobrunners to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005776 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:46:25] T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 [15:46:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2031-2032].codfw.wmnet with reason: T355868 [15:48:11] (03CR) 10JMeybohm: "agreed, thanks anyways" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [15:48:16] (03Abandoned) 10JMeybohm: Don't restart rsyslog on updates, kill exporter instead [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005718 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [15:48:19] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b2-codfw.mgmt with reason: prepping for server uplink migration codfw rack b2 [15:48:38] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b2-codfw.mgmt with reason: prepping for server uplink migration codfw rack b2 [15:48:44] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568216 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=93a3c441-2097-4840-a202-5694f260c1b5... [15:50:04] (03CR) 10Clément Goubert: [C: 04-1] kubernetes: move all remaining eligible codfw jobrunners to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:50:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P57735 and previous config saved to /var/cache/conftool/dbconfig/20240222-155005-arnaudb.json [15:51:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: After recloning', diff saved to https://phabricator.wikimedia.org/P57736 and previous config saved to /var/cache/conftool/dbconfig/20240222-155141-root.json [15:53:37] !log depool thanos-fe2002 T355868 [15:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:43] T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 [15:54:02] !log depool codfs-mw T355868 [15:54:05] !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [15:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:32] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 25 hosts with reason: Migrating servers in codfw rack B2 to lsw1-b2-codfw [15:55:43] (03PS3) 10Hnowlan: kubernetes: move all remaining eligible codfw jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) [15:56:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 25 hosts with reason: Migrating servers in codfw rack B2 to lsw1-b2-codfw [15:56:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568300 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90864fe1-6d91-45db-a2a5-2bb22463c114... [15:56:37] (03CR) 10Hnowlan: kubernetes: move all remaining eligible codfw jobrunners to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:57:20] !log depooling mw[1458,1467-1468,1483-1485,1494].eqiad.wmnet in advance of reimaging [15:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:54] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host testvm2002.codfw.wmnet with OS bullseye [16:00:21] !log Commencing network maintenance migrating servers to new switch codfw rack B2 T355868 [16:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:42] T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 [16:04:19] !log volans@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [16:05:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T357189)', diff saved to https://phabricator.wikimedia.org/P57737 and previous config saved to /var/cache/conftool/dbconfig/20240222-160512-arnaudb.json [16:05:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [16:05:17] !log volans@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1001.eqiad.wmnet [16:05:18] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:05:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [16:05:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T357189)', diff saved to https://phabricator.wikimedia.org/P57738 and previous config saved to /var/cache/conftool/dbconfig/20240222-160534-arnaudb.json [16:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: After recloning', diff saved to https://phabricator.wikimedia.org/P57739 and previous config saved to /var/cache/conftool/dbconfig/20240222-160646-root.json [16:07:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T357189)', diff saved to https://phabricator.wikimedia.org/P57740 and previous config saved to /var/cache/conftool/dbconfig/20240222-160753-arnaudb.json [16:08:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568400 (10cmooney) All hosts moved successfully and back responding to pings. [16:09:01] 10SRE, 10Wikimedia-Mailing-lists: Message-ID: delayed by 3 days - https://phabricator.wikimedia.org/T358198#9568404 (10matmarex) [16:09:04] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020#9568405 (10matmarex) [16:09:35] 10SRE, 10Wikimedia-Mailing-lists: Message-ID: delayed by 3 days - https://phabricator.wikimedia.org/T358198#9566926 (10matmarex) (presumably reopened by accident because Phabricator doesn't detect edit conflicts) [16:10:24] !log repool thanos-fe2002 T355868 [16:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:30] T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 [16:10:56] jouncebot nowandnext [16:10:57] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [16:10:57] In 0 hour(s) and 49 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1700) [16:11:02] !log repool codfs-mw T355868 [16:11:07] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [16:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:45] sigh, at least my typos were consistent 🤦 [16:12:58] (03PS2) 10Filippo Giunchedi: sre: fix WidespreadPuppetFailure logic for no resources [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) [16:13:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568428 (10MatthewVernon) Swift is back OK, thanks. [16:13:12] (03CR) 10Filippo Giunchedi: "Good point, I've moved both to 3% instead" [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [16:16:09] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [16:18:43] (03PS3) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [16:19:35] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS bullseye [16:21:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: After recloning', diff saved to https://phabricator.wikimedia.org/P57741 and previous config saved to /var/cache/conftool/dbconfig/20240222-162151-root.json [16:22:08] (03PS1) 10Filippo Giunchedi: jaeger: bump collector and query resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005794 (https://phabricator.wikimedia.org/T358152) [16:22:41] (03CR) 10CI reject: [V: 04-1] WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [16:22:50] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [16:23:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P57742 and previous config saved to /var/cache/conftool/dbconfig/20240222-162300-arnaudb.json [16:25:28] !log dancy@deploy2002 Installing scap version "4.66.0" for 458 hosts [16:25:31] (03CR) 10CDanis: [C: 03+1] jaeger: bump collector and query resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005794 (https://phabricator.wikimedia.org/T358152) (owner: 10Filippo Giunchedi) [16:26:26] !log dancy@deploy2002 Installation of scap version "4.66.0" completed for 458 hosts [16:28:07] !log dancy@deploy2002 Started scap: testing T357402 [16:28:12] T357402: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 [16:28:40] !log fabfur@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp[2031-2032].codfw.wmnet [16:28:42] !log fabfur@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp[2031-2032].codfw.wmnet [16:30:51] !log fabfur@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=(cdn|ats-be) [16:30:57] !log fabfur@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=(cdn|ats-be) [16:32:56] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568564 (10cmooney) p:05Triage→03Medium [16:33:38] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568586 (10cmooney) [16:33:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568588 (10Fabfur) cp2031 and cp2032 are ok and repooled [16:36:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:36:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:38:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P57743 and previous config saved to /var/cache/conftool/dbconfig/20240222-163806-arnaudb.json [16:39:09] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568622 (10cmooney) All interfaces on asw-a-codfw are set to 'disabled' apart from the uplinks to ssw's, and no mac's learnt on SSW side so proceeding to delete those links... [16:42:12] !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: service=parsoid-php,name=kubernetes.* [16:43:04] !log dancy@deploy2002 sync-world aborted: testing T357402 (duration: 14m 57s) [16:43:10] T357402: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 [16:43:43] (03CR) 10Hnowlan: [C: 03+1] conftool: Remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1005728 (owner: 10Alexandros Kosiaris) [16:45:29] !log dancy@deploy2002 Started scap: testing T357402 again [16:49:59] (03PS1) 10Cathal Mooney: Remove definition/config for codfw ssw's ESI-LAG to asw-a-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1005799 (https://phabricator.wikimedia.org/T358244) [16:52:39] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:53:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T357189)', diff saved to https://phabricator.wikimedia.org/P57744 and previous config saved to /var/cache/conftool/dbconfig/20240222-165312-arnaudb.json [16:53:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [16:53:21] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:53:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:53:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [16:53:37] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:53:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [16:53:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [16:54:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T357189)', diff saved to https://phabricator.wikimedia.org/P57745 and previous config saved to /var/cache/conftool/dbconfig/20240222-165401-arnaudb.json [16:54:27] !log dancy@deploy2002 Finished scap: testing T357402 again (duration: 08m 58s) [16:54:35] T357402: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 [16:56:04] !log disabling link from asw-a-codfw vc to ssw1-a1-codfw and ssw1-a8-codfw T355544 [16:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:09] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [16:56:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T357189)', diff saved to https://phabricator.wikimedia.org/P57746 and previous config saved to /var/cache/conftool/dbconfig/20240222-165619-arnaudb.json [16:56:42] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9568746 (10dancy) 05Open→03Resolved [16:56:50] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9568747 (10dancy) [16:57:12] (03PS3) 10Bking: rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) [16:57:48] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544" [16:58:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544" [16:58:47] (03CR) 10Cathal Mooney: [C: 03+2] Remove definition/config for codfw ssw's ESI-LAG to asw-a-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1005799 (https://phabricator.wikimedia.org/T358244) (owner: 10Cathal Mooney) [16:58:57] (03CR) 10DLynch: [C: 03+1] DiscussionTools: Remove no-op config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004749 (owner: 10Esanders) [17:00:04] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1700) [17:00:04] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:46] o/ [17:00:56] (03Merged) 10jenkins-bot: Remove definition/config for codfw ssw's ESI-LAG to asw-a-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1005799 (https://phabricator.wikimedia.org/T358244) (owner: 10Cathal Mooney) [17:00:58] dancy: hi! looking [17:01:07] looking as well [17:01:38] jhathaway: all yours if you'd like it :) [17:01:48] sure [17:02:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568765 (10cmooney) Ok I've removed the configuration for the ESI-LAG between the codfw spine switches and asw-a-codfw both sides now. DC-Ops you can... [17:02:18] (03CR) 10JHathaway: [C: 03+2] logstash_checker.py: Exit 10 if over error threshold [puppet] - 10https://gerrit.wikimedia.org/r/1005610 (https://phabricator.wikimedia.org/T144033) (owner: 10Ahmon Dancy) [17:02:39] dancy: any special steps you want, other than merging in? [17:02:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568799 (10cmooney) [17:03:10] Just merging is fine [17:03:13] Thanks! [17:03:20] great, done [17:05:33] !log disabling IPv6 RAs for private1-a-codfw vlan on codfw core routers T355544 [17:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:50] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [17:11:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P57747 and previous config saved to /var/cache/conftool/dbconfig/20240222-171125-arnaudb.json [17:11:32] (03CR) 10Brennen Bearnes: [C: 03+1] "Not very Pythonic to have well behaved exit codes instead of just spewing confusing exception traces all over the place." [puppet] - 10https://gerrit.wikimedia.org/r/1005610 (https://phabricator.wikimedia.org/T144033) (owner: 10Ahmon Dancy) [17:17:24] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108#9568895 (10MatthewVernon) a:03MatthewVernon [17:25:00] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 198, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:25:56] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:26:22] (03CR) 10Hnowlan: [C: 03+2] kubernetes: move all remaining eligible eqiad jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005776 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [17:26:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P57748 and previous config saved to /var/cache/conftool/dbconfig/20240222-172632-arnaudb.json [17:30:43] (03CR) 10Hnowlan: [C: 03+2] kubernetes: move all remaining eligible eqiad jobrunners to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005776 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [17:31:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:34:37] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108#9569041 (10jcrespo) 05In progress→03Resolved It took some time to confirm it live, because the number of new deleted files don't grow as fast as the "late... [17:35:39] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host testvm2002.codfw.wmnet with OS bullseye [17:36:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [17:38:05] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Cassandra, 10Data-Persistence, 10Sustainability (Incident Followup): Document best-practice for hinted-handoff - https://phabricator.wikimedia.org/T315517#9569072 (10Eevans) p:05Triage→03Low [17:39:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [17:39:28] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1458.eqiad.wmnet with OS bullseye [17:39:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1467.eqiad.wmnet with OS bullseye [17:40:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [17:41:07] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1468.eqiad.wmnet with OS bullseye [17:41:22] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569088 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1468.eqiad.wmnet with OS bullseye [17:41:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1483.eqiad.wmnet with OS bullseye [17:41:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1484.eqiad.wmnet with OS bullseye [17:41:32] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1485.eqiad.wmnet with OS bullseye [17:41:34] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1483.eqiad.wmnet with OS bullseye [17:41:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T357189)', diff saved to https://phabricator.wikimedia.org/P57749 and previous config saved to /var/cache/conftool/dbconfig/20240222-174138-arnaudb.json [17:41:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:41:42] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1484.eqiad.wmnet with OS bullseye [17:41:44] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1494.eqiad.wmnet with OS bullseye [17:41:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:41:55] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:41:56] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1485.eqiad.wmnet with OS bullseye [17:42:02] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1494.eqiad.wmnet with OS bullseye [17:42:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [17:42:19] (03CR) 10CDanis: [C: 03+2] jaeger: bump collector and query resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005794 (https://phabricator.wikimedia.org/T358152) (owner: 10Filippo Giunchedi) [17:42:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [17:42:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [17:42:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [17:43:06] (03Merged) 10jenkins-bot: jaeger: bump collector and query resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005794 (https://phabricator.wikimedia.org/T358152) (owner: 10Filippo Giunchedi) [17:43:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [17:43:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [17:43:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [17:43:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T357189)', diff saved to https://phabricator.wikimedia.org/P57750 and previous config saved to /var/cache/conftool/dbconfig/20240222-174328-arnaudb.json [17:43:47] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [17:43:49] 10SRE, 10Traffic: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9569109 (10cmooney) p:05Triage→03Low [17:43:50] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [17:43:58] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [17:44:02] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [17:44:18] 10SRE, 10Cassandra, 10Data-Persistence: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9569125 (10Eevans) p:05Triage→03Medium [17:44:26] (03CR) 10BCornwall: "Looks good to me, see inline." [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [17:44:41] 10SRE, 10Traffic: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9569128 (10cmooney) [17:44:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T357189)', diff saved to https://phabricator.wikimedia.org/P57751 and previous config saved to /var/cache/conftool/dbconfig/20240222-174449-arnaudb.json [17:44:50] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [17:45:10] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [17:45:16] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:46:05] ^ me, not an actual problem. acked [17:48:38] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:55] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2384.codfw.wmnet with OS bullseye [17:52:08] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1458.eqiad.wmnet with reason: host reimage [17:52:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1467.eqiad.wmnet with reason: host reimage [17:54:02] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1468.eqiad.wmnet with reason: host reimage [17:54:19] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage [17:54:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage [17:54:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1458.eqiad.wmnet with reason: host reimage [17:54:48] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage [17:54:59] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1494.eqiad.wmnet with reason: host reimage [17:57:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage [17:59:09] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-02-22-164056-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005809 (https://phabricator.wikimedia.org/T280500) [17:59:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1468.eqiad.wmnet with reason: host reimage [17:59:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P57752 and previous config saved to /var/cache/conftool/dbconfig/20240222-175956-arnaudb.json [18:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1800) [18:00:44] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2024-02-22-164056-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005809 (https://phabricator.wikimedia.org/T280500) (owner: 10BryanDavis) [18:01:39] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-02-22-164056-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005809 (https://phabricator.wikimedia.org/T280500) (owner: 10BryanDavis) [18:01:49] 10SRE, 10Traffic: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9569247 (10cmooney) So I'm realising the RAs are how the LVS is determining the attached v6 subnet and creating the auto-assigned eui-64 addresses on each vlan interface.... [18:01:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1494.eqiad.wmnet with reason: host reimage [18:02:45] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:02:47] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:03:07] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:03:11] !log hnowlan@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host mw2384.codfw.wmnet with OS bullseye [18:03:21] 10SRE, 10Continuous-Integration-Infrastructure, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9569250 (10MoritzMuehlenhoff) What do you want to use as the host name, something like zuul1001? [18:03:25] (SystemdUnitFailed) firing: ferm.service on mw1457:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:38] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:03:43] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2384.codfw.wmnet with OS bullseye [18:04:04] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:04:11] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:04:34] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:04:49] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage [18:07:49] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1467.eqiad.wmnet with reason: host reimage [18:11:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage [18:12:01] 10SRE, 10Traffic: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9569280 (10cmooney) >>! In T358260#9569247, @cmooney wrote: > I notice there is a //"net.ipv6.conf..accept_ra_defrtr"// which from what I can tell will not add a... [18:12:35] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1458.eqiad.wmnet with OS bullseye [18:13:47] (03PS17) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [18:14:29] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#9569312 (10Volans) I've tested that the cookbook works fine with the existing cached firmwares on the cumin... [18:14:49] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1483.eqiad.wmnet with OS bullseye [18:14:55] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [18:15:02] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569314 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1483.eqiad.wmnet with OS bullseye completed: - mw1483 (**PASS**)... [18:15:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P57753 and previous config saved to /var/cache/conftool/dbconfig/20240222-181502-arnaudb.json [18:17:01] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1468.eqiad.wmnet with OS bullseye [18:17:12] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1468.eqiad.wmnet with OS bullseye completed: - mw1468 (**PASS**)... [18:18:57] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2384.codfw.wmnet with reason: host reimage [18:19:50] 10SRE, 10Infrastructure-Foundations, 10netops: Do we need to generate aggregates for LVS service IP ranges? - https://phabricator.wikimedia.org/T350354#9569320 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T350354#9312533, @BBlack wrote: > I don't suspect it serves any real purpose at present, unles... [18:21:03] PROBLEM - Check whether ferm is active by checking the default input chain on mw1457 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:21:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2384.codfw.wmnet with reason: host reimage [18:22:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1484.eqiad.wmnet with OS bullseye [18:22:26] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [18:22:30] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1484.eqiad.wmnet with OS bullseye completed: - mw1484 (**PASS**)... [18:22:31] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [18:24:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1494.eqiad.wmnet with OS bullseye [18:24:56] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1494.eqiad.wmnet with OS bullseye completed: - mw1494 (**WARN**)... [18:25:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1467.eqiad.wmnet with OS bullseye [18:28:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1485.eqiad.wmnet with OS bullseye [18:28:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9569364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1485.eqiad.wmnet with OS bullseye completed: - mw1485 (**PASS**)... [18:30:03] (03PS5) 10Cathal Mooney: Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) [18:30:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T357189)', diff saved to https://phabricator.wikimedia.org/P57755 and previous config saved to /var/cache/conftool/dbconfig/20240222-183009-arnaudb.json [18:30:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:30:20] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:30:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:30:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T357189)', diff saved to https://phabricator.wikimedia.org/P57756 and previous config saved to /var/cache/conftool/dbconfig/20240222-183030-arnaudb.json [18:31:48] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2385.codfw.wmnet with OS bullseye [18:32:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T357189)', diff saved to https://phabricator.wikimedia.org/P57757 and previous config saved to /var/cache/conftool/dbconfig/20240222-183251-arnaudb.json [18:40:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:40:56] (03PS1) 10Ssingh: Revert "conftool: introduce schema and host file for dnsboxes" [puppet] - 10https://gerrit.wikimedia.org/r/1005693 [18:42:39] (03CR) 10Ssingh: "After some more discussion, bblack and I have decided to revert this custom schema. This schema was necessitated mostly by our requirement" [puppet] - 10https://gerrit.wikimedia.org/r/1005693 (owner: 10Ssingh) [18:44:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2384.codfw.wmnet with OS bullseye [18:45:18] (03CR) 10Ayounsi: [C: 03+2] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [18:46:53] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2385.codfw.wmnet with reason: host reimage [18:47:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P57758 and previous config saved to /var/cache/conftool/dbconfig/20240222-184757-arnaudb.json [18:49:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2385.codfw.wmnet with reason: host reimage [18:50:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2057-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:52:17] 10SRE, 10Traffic: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9569458 (10cmooney) [18:53:04] (03PS1) 10Ssingh: Revert "tests: add schema for dnsbox" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 [18:53:26] (03Merged) 10jenkins-bot: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [18:56:08] (03CR) 10CI reject: [V: 04-1] Revert "tests: add schema for dnsbox" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:56:59] (03CR) 10Ssingh: "13:55:06 ERROR: InvocationError for command /src/.tox/py38-style/bin/black --config black.toml --check --diff . (exited with code 1)" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:59:23] (03PS1) 10CDanis: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005819 [18:59:28] (03PS1) 10Jdlrobson: Change font-size "Small" label to "Standard" [extensions/MobileFrontend] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005695 (https://phabricator.wikimedia.org/T358074) [19:00:04] jeena and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1900). [19:00:29] o/ [19:01:11] o/ [19:03:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P57759 and previous config saved to /var/cache/conftool/dbconfig/20240222-190304-arnaudb.json [19:04:03] 10SRE, 10Traffic: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9569509 (10cmooney) FWIW this was the test I ran on one of our bookworm hosts. Starting with primary interface down, and vlan interface which is built on it also down, plu... [19:05:40] (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic2063-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [19:05:55] ^^ looking into these Elastic alerts [19:06:02] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005820 (https://phabricator.wikimedia.org/T354437) [19:06:04] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005820 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [19:06:40] (03PS2) 10CDanis: jaeger: also give cpu res/limit to oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005819 (https://phabricator.wikimedia.org/T358152) [19:06:58] (03CR) 10CDanis: [C: 03+2] jaeger: also give cpu res/limit to oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005819 (https://phabricator.wikimedia.org/T358152) (owner: 10CDanis) [19:07:03] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005820 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [19:07:54] (03Merged) 10jenkins-bot: jaeger: also give cpu res/limit to oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005819 (https://phabricator.wikimedia.org/T358152) (owner: 10CDanis) [19:14:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2385.codfw.wmnet with OS bullseye [19:15:16] (03PS1) 10CDanis: jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005822 [19:15:39] (CirrusSearchNodeIndexingNotIncreasing) resolved: (2) Elasticsearch instance elastic2063-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [19:16:07] (03CR) 10CI reject: [V: 04-1] jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005822 (owner: 10CDanis) [19:17:45] (03PS2) 10CDanis: jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005822 [19:17:58] (03PS1) 10Dbrant: Add verbiage for Account Vanishing contact page. [extensions/WikimediaMessages] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005696 (https://phabricator.wikimedia.org/T343536) [19:18:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T357189)', diff saved to https://phabricator.wikimedia.org/P57760 and previous config saved to /var/cache/conftool/dbconfig/20240222-191810-arnaudb.json [19:18:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [19:18:18] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:18:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [19:18:32] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.19 refs T354437 [19:18:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T357189)', diff saved to https://phabricator.wikimedia.org/P57761 and previous config saved to /var/cache/conftool/dbconfig/20240222-191834-arnaudb.json [19:18:37] T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437 [19:19:07] 10SRE, 10ops-codfw, 10serviceops: Issues reimaging servers in codfw - https://phabricator.wikimedia.org/T358001#9569554 (10hnowlan) 05Open→03Resolved a:03hnowlan >>! In T358001#9563665, @Jhancock.wm wrote: > @hnowlan I've replaced the network cable on both of these. These are both connected to a 1G swi... [19:19:25] (03CR) 10CDanis: [C: 03+2] jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005822 (owner: 10CDanis) [19:20:13] (03Merged) 10jenkins-bot: jaeger: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005822 (owner: 10CDanis) [19:20:38] (03CR) 10Ebernhardson: [C: 03+2] rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [19:20:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T357189)', diff saved to https://phabricator.wikimedia.org/P57762 and previous config saved to /var/cache/conftool/dbconfig/20240222-192055-arnaudb.json [19:21:41] (03PS1) 10Dbrant: testwiki: Allow modifying email in account vanishing contact form. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005824 (https://phabricator.wikimedia.org/T343536) [19:21:44] (03Merged) 10jenkins-bot: rdf-streaming-updater: raise storage alert threshold [alerts] - 10https://gerrit.wikimedia.org/r/1005791 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [19:22:56] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:23:39] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:27:29] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:29:26] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cleanup incorrect asset tags - robh@cumin2002" [19:30:18] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cleanup incorrect asset tags - robh@cumin2002" [19:30:18] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:33:46] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9569601 (10bvibber) >>! In T358044#9562210, @Bugreporter wrote: >>Too late now, but Phabricator accounts are easy enough to rename and... [19:35:36] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9569603 (10bvibber) >>! In T358044#9562598, @MoritzMuehlenhoff wrote: > @bvibber Renaming the user name for SSH access will leave file... [19:36:00] (03PS4) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [19:36:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P57763 and previous config saved to /var/cache/conftool/dbconfig/20240222-193601-arnaudb.json [19:40:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS bullseye [19:40:39] (03CR) 10CI reject: [V: 04-1] WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [19:45:11] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:46:14] jouncebot: nowandnext [19:46:14] For the next 1 hour(s) and 13 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T1900) [19:46:14] In 1 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T2100) [19:46:53] (03PS4) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596) [19:47:56] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9569624 (10bvibber) Gerrit lets me connect but won't let me push updates to a patchset: ` % git review remote: remote: Processing ch... [19:48:02] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596) (owner: 10Jforrester) [19:49:00] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596) (owner: 10Jforrester) [19:49:34] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:50:04] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:50:10] (03CR) 10Fabfur: [V: 03+1] haproxy: configure extended logging (preparatory for Benthos) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [19:50:35] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:51:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P57764 and previous config saved to /var/cache/conftool/dbconfig/20240222-195108-arnaudb.json [19:52:03] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:52:18] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:52:27] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [19:53:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:53:06] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9569661 (10taavi) Only members of https://gerrit.wikimedia.org/r/admin/groups/2021f25e7515187a81d51f8fe14dd6f25617cce0 can amend chang... [19:53:26] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:53:37] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9569668 (10bvibber) Thx! [19:53:43] (03PS1) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005828 [19:53:59] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-18-182456 to 2024-02-12-155846 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002624 (https://phabricator.wikimedia.org/T296937) [19:54:08] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade orchestrator from 2024-01-18-182456 to 2024-02-12-155846 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002624 (https://phabricator.wikimedia.org/T296937) (owner: 10Jforrester) [19:55:07] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-01-18-182456 to 2024-02-12-155846 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002624 (https://phabricator.wikimedia.org/T296937) (owner: 10Jforrester) [19:55:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [19:55:20] (03CR) 10CI reject: [V: 04-1] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005828 (owner: 10CDobbins) [19:56:10] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:56:49] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:57:27] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:58:45] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:58:53] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:00:37] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:00:46] 10SRE, 10Data-Platform-SRE: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9569715 (10Gehel) p:05Triage→03High [20:02:42] (03Abandoned) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005828 (owner: 10CDobbins) [20:03:25] (SystemdUnitFailed) firing: (2) ferm.service on mw1457:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:42] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-01-18-182630 to 2024-02-12-160222 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002625 (https://phabricator.wikimedia.org/T287978) [20:05:21] (03PS1) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005830 [20:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T357189)', diff saved to https://phabricator.wikimedia.org/P57765 and previous config saved to /var/cache/conftool/dbconfig/20240222-200614-arnaudb.json [20:06:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [20:06:17] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host testvm2002.codfw.wmnet with OS bullseye [20:06:22] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:06:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [20:06:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T357189)', diff saved to https://phabricator.wikimedia.org/P57766 and previous config saved to /var/cache/conftool/dbconfig/20240222-200636-arnaudb.json [20:07:44] (03CR) 10Ssingh: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1005830 (owner: 10CDobbins) [20:08:22] (03CR) 10CDobbins: [C: 03+2] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins) [20:08:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T357189)', diff saved to https://phabricator.wikimedia.org/P57767 and previous config saved to /var/cache/conftool/dbconfig/20240222-200858-arnaudb.json [20:12:03] (03CR) 10CDobbins: [C: 03+2] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005830 (owner: 10CDobbins) [20:17:48] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968#9569913 (10darthmon_wmde) [20:19:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9569914 (10Jclark-ctr) a:03Jclark-ctr [20:19:58] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968#9569915 (10darthmon_wmde) 05Invalid→03Open I am very sorry - this ticket got out of my sight and I completely forgot about it. Could we pick it up anew, please? I just added... [20:21:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw1494 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:25] (SystemdUnitFailed) firing: (3) ferm.service on mw1457:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P57768 and previous config saved to /var/cache/conftool/dbconfig/20240222-202404-arnaudb.json [20:35:43] (03PS5) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [20:36:31] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9569964 (10Peachey88) >>! In T358044#9569601, @bvibber wrote: > That's probably the way to go then, if it'll keep assignments intact w... [20:39:08] (03PS6) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [20:39:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P57769 and previous config saved to /var/cache/conftool/dbconfig/20240222-203911-arnaudb.json [20:41:06] (03PS7) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [20:41:17] PROBLEM - Check whether ferm is active by checking the default input chain on mw2384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:44:31] 10SRE-swift-storage, 10MediaWiki-Uploading, 10MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9569981 (10Bawolff) >>! In T200820#956... [20:44:41] (03PS8) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [20:45:39] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS bullseye [20:49:06] (03CR) 10CI reject: [V: 04-1] WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [20:53:25] (SystemdUnitFailed) firing: (4) ferm.service on mw1457:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T357189)', diff saved to https://phabricator.wikimedia.org/P57770 and previous config saved to /var/cache/conftool/dbconfig/20240222-205417-arnaudb.json [20:54:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:54:24] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:54:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:54:38] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9570020 (10thcipriani) >>! In T358117#9566949, @Clement_Goubert wrote: > We've talked this over, and while doing swagger checks mad... [20:54:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T357189)', diff saved to https://phabricator.wikimedia.org/P57771 and previous config saved to /var/cache/conftool/dbconfig/20240222-205440-arnaudb.json [20:55:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9570023 (10VRiley-WMF) [20:57:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T357189)', diff saved to https://phabricator.wikimedia.org/P57772 and previous config saved to /var/cache/conftool/dbconfig/20240222-205701-arnaudb.json [20:57:27] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240222T2100). nyaa~ [21:00:05] jan_drewniak, dbrant, and bawolff: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] \o/ [21:00:33] o/ [21:00:42] o/ [21:01:10] i can deploy [21:01:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [21:02:10] hi jan_drewniak :) i'll start with yours [21:02:29] hi cjming! thanks [21:02:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005695 (https://phabricator.wikimedia.org/T358074) (owner: 10Jdlrobson) [21:06:28] 10SRE, 10Data-Platform-SRE: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9570086 (10Dwisehaupt) We are tracking this from the fr-tech side in T358043. No impact on your work, just adding for full knowledge. [21:06:36] hi dbrant :) i'll get your backport going too bec CI [21:06:59] thx cjming [21:07:02] (03CR) 10Clare Ming: [C: 03+2] Add verbiage for Account Vanishing contact page. [extensions/WikimediaMessages] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005696 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:07:49] bawolff: can you prep your backport patch and add it to the cal? [21:08:24] oh right, sorry its been a super long time since I've done this. Just a moment [21:08:46] np! [21:09:35] (03PS1) 10Brian Wolff: Improve chunked upload jobs and abort assemble job if already in progress [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005698 (https://phabricator.wikimedia.org/T200820) [21:10:18] PROBLEM - Check whether ferm is active by checking the default input chain on mw2385 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:10:27] cjming: calendar updated [21:10:36] ty [21:11:00] * bawolff used to just doing config patches. [21:12:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P57773 and previous config saved to /var/cache/conftool/dbconfig/20240222-211208-arnaudb.json [21:12:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [21:17:48] (03PS9) 10Cathal Mooney: Adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [21:21:25] (03Merged) 10jenkins-bot: Change font-size "Small" label to "Standard" [extensions/MobileFrontend] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005695 (https://phabricator.wikimedia.org/T358074) (owner: 10Jdlrobson) [21:21:38] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1005695|Change font-size "Small" label to "Standard" (T358074)]] [21:21:47] T358074: Mobile labels are incorrect for font size - https://phabricator.wikimedia.org/T358074 [21:22:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9570125 (10VRiley-WMF) [21:27:06] (03Merged) 10jenkins-bot: Add verbiage for Account Vanishing contact page. [extensions/WikimediaMessages] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005696 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [21:27:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P57774 and previous config saved to /var/cache/conftool/dbconfig/20240222-212715-arnaudb.json [21:27:50] ahoy - if any SREs are around -- maybe i'm being impatient but i don't recall in recent memory getting changes out to test servers taking so long -- seems stuck on "K8s images build/push output redirected to /home/cjming/scap-image-build-and-push-log" in my terminal [21:28:06] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [21:28:40] doh - nvm - just started going again [21:29:16] but it does seem pokey fwiw [21:29:21] cjming: backporting i18n changes is always super slow [21:29:39] ah - gtk - thx [21:29:47] (apologies) [21:29:55] lol - nw [21:35:51] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-02-12-155846 to 2024-02-22-165335 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005843 (https://phabricator.wikimedia.org/T335695) [21:35:54] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1005695|Change font-size "Small" label to "Standard" (T358074)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:35:57] jan_drewniak: up on mwdebug - lmk when to sync [21:36:06] T358074: Mobile labels are incorrect for font size - https://phabricator.wikimedia.org/T358074 [21:38:06] jan_drewniak: shall i sync? [21:39:11] hey cjming looks good to sync [21:39:25] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [21:42:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T357189)', diff saved to https://phabricator.wikimedia.org/P57775 and previous config saved to /var/cache/conftool/dbconfig/20240222-214221-arnaudb.json [21:42:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [21:42:28] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:42:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [21:42:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:43:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:43:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T357189)', diff saved to https://phabricator.wikimedia.org/P57776 and previous config saved to /var/cache/conftool/dbconfig/20240222-214310-arnaudb.json [21:47:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T357189)', diff saved to https://phabricator.wikimedia.org/P57777 and previous config saved to /var/cache/conftool/dbconfig/20240222-214732-arnaudb.json [21:47:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:48:38] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:46] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1005695|Change font-size "Small" label to "Standard" (T358074)]] (duration: 29m 07s) [21:50:51] T358074: Mobile labels are incorrect for font size - https://phabricator.wikimedia.org/T358074 [21:51:05] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1005696|Add verbiage for Account Vanishing contact page. (T343536)]] [21:51:11] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [21:51:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9570268 (10VRiley-WMF) logging-hd1001 Rack A7 U 32 logging-hd1002 Rack B7 U 21 logging-hd1003 Rack D7 U26 [21:51:28] (03CR) 10Clare Ming: [C: 03+2] Improve chunked upload jobs and abort assemble job if already in progress [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005698 (https://phabricator.wikimedia.org/T200820) (owner: 10Brian Wolff) [21:52:06] jan_drewniak: should be live! [21:53:09] awesome, thanks! [21:53:27] yw! [21:54:15] dbrant: getting your 1st patch out to test servers - just waiting like before - your next patch should go quick [21:54:50] bawolff: went ahead and +2'd your backport in the meantime [21:56:12] 👍 [22:02:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P57778 and previous config saved to /var/cache/conftool/dbconfig/20240222-220238-arnaudb.json [22:05:44] !log cjming@deploy2002 dbrant and cjming: Backport for [[gerrit:1005696|Add verbiage for Account Vanishing contact page. (T343536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:05:48] dbrant: wanna test? lmk if i should sync [22:06:02] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [22:06:03] cjming: all good! [22:06:11] cool - syncing [22:06:14] !log cjming@deploy2002 dbrant and cjming: Continuing with sync [22:09:25] (03Merged) 10jenkins-bot: Improve chunked upload jobs and abort assemble job if already in progress [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005698 (https://phabricator.wikimedia.org/T200820) (owner: 10Brian Wolff) [22:16:25] seeing it live. [22:17:18] cool ! just waiting for php restarts [22:17:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P57779 and previous config saved to /var/cache/conftool/dbconfig/20240222-221745-arnaudb.json [22:18:32] thanks for your patience dbrant, bawolff - we're going over but since it's thurs, i'll finish the queue - the rest should go quick [22:18:52] ty :) [22:18:53] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1005696|Add verbiage for Account Vanishing contact page. (T343536)]] (duration: 27m 47s) [22:18:59] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [22:19:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005824 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [22:19:55] (03Merged) 10jenkins-bot: testwiki: Allow modifying email in account vanishing contact form. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005824 (https://phabricator.wikimedia.org/T343536) (owner: 10Dbrant) [22:20:21] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1005824|testwiki: Allow modifying email in account vanishing contact form. (T343536)]] [22:21:47] !log cjming@deploy2002 cjming and dbrant: Backport for [[gerrit:1005824|testwiki: Allow modifying email in account vanishing contact form. (T343536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:21:53] dbrant: 1st patch should be live everywhere, 2nd patch on test servers if you want to check [22:22:11] cjming: looks good [22:22:16] !log cjming@deploy2002 cjming and dbrant: Continuing with sync [22:30:19] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1005824|testwiki: Allow modifying email in account vanishing contact form. (T343536)]] (duration: 09m 58s) [22:30:30] T343536: [M] Create v1 of Special:Contact page for account vanish requests - https://phabricator.wikimedia.org/T343536 [22:30:38] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1005698|Improve chunked upload jobs and abort assemble job if already in progress (T200820)]] [22:30:44] T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820 [22:30:51] dbrant: 2nd patch should be live! [22:31:14] confirmed. thanks so much cjming [22:31:22] yw! [22:31:50] bawolff: i'm assuming your patch isn't really testable - should i just go ahead and sync? [22:32:02] !log cjming@deploy2002 bawolff and cjming: Backport for [[gerrit:1005698|Improve chunked upload jobs and abort assemble job if already in progress (T200820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:32:06] correct, it mostly applies to the job queue [22:32:14] cool - syncing then [22:32:16] !log cjming@deploy2002 bawolff and cjming: Continuing with sync [22:32:18] only after you upload a multi-GB file [22:32:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T357189)', diff saved to https://phabricator.wikimedia.org/P57780 and previous config saved to /var/cache/conftool/dbconfig/20240222-223251-arnaudb.json [22:32:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:32:59] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:33:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:33:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T357189)', diff saved to https://phabricator.wikimedia.org/P57781 and previous config saved to /var/cache/conftool/dbconfig/20240222-223314-arnaudb.json [22:35:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T357189)', diff saved to https://phabricator.wikimedia.org/P57782 and previous config saved to /var/cache/conftool/dbconfig/20240222-223536-arnaudb.json [22:36:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9570449 (10VRiley-WMF) es1035 WMF10710 FPZGC14 E 5 U 18 CableID 20220092 es1036 WMF10711 DPZGC14 E 6 U 18 CableID 20220057 es1037 WMF10712 CPZGC14 E 7 U 18 CableID 20220096 es1... [22:40:24] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1005698|Improve chunked upload jobs and abort assemble job if already in progress (T200820)]] (duration: 09m 46s) [22:40:27] bawolff: should be live! [22:40:30] T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820 [22:40:45] ty [22:40:58] yw [22:41:00] !log end of UTC late backport window [22:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P57783 and previous config saved to /var/cache/conftool/dbconfig/20240222-225042-arnaudb.json [23:01:13] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1035-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005591 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [23:05:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P57784 and previous config saved to /var/cache/conftool/dbconfig/20240222-230549-arnaudb.json [23:10:08] 10SRE-OnFire, 10Incident Tooling: introducing corto internal incident response workflow automation - https://phabricator.wikimedia.org/T356790#9570525 (10jhathaway) [23:20:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T357189)', diff saved to https://phabricator.wikimedia.org/P57785 and previous config saved to /var/cache/conftool/dbconfig/20240222-232056-arnaudb.json [23:20:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:21:02] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:21:09] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded (10 MB interruption) - https://phabricator.wikimedia.org/T350917#9570554 (10Bawolff) >>! In T350917#9358944, @MatthewVernon wrote: > Picking a recent failure: > ` > mvernon@cumin1001:~$ sudo cumin -x --force --no-progress --no-color -o... [23:21:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:21:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T357189)', diff saved to https://phabricator.wikimedia.org/P57786 and previous config saved to /var/cache/conftool/dbconfig/20240222-232118-arnaudb.json [23:21:20] (03PS1) 10Tim Starling: OCR: Add HTTP proxy config [extensions/Wikisource] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005700 (https://phabricator.wikimedia.org/T357857) [23:21:51] (03CR) 10Tim Starling: [C: 03+2] OCR: Add HTTP proxy config [extensions/Wikisource] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005700 (https://phabricator.wikimedia.org/T357857) (owner: 10Tim Starling) [23:22:21] (03PS2) 10Tim Starling: CommonSettings: Set $wgWikisourceHttpProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:22:41] (03CR) 10Tim Starling: [C: 03+2] CommonSettings: Set $wgWikisourceHttpProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:23:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T357189)', diff saved to https://phabricator.wikimedia.org/P57787 and previous config saved to /var/cache/conftool/dbconfig/20240222-232338-arnaudb.json [23:27:51] (03PS1) 10Zabe: block: Pass wikiId to DatabaseBlock::getId in DatabaseBlockStore [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005701 (https://phabricator.wikimedia.org/T358208) [23:27:58] jouncebot: nowandnext [23:27:58] No deployments scheduled for the next 7 hour(s) and 32 minute(s) [23:27:58] In 7 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240223T0700) [23:28:05] (03CR) 10Zabe: [C: 03+2] block: Pass wikiId to DatabaseBlock::getId in DatabaseBlockStore [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005701 (https://phabricator.wikimedia.org/T358208) (owner: 10Zabe) [23:30:37] (03CR) 10Tim Starling: [C: 03+2] "..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:32:05] TimStarling: could you ping me when you are done with backporting? [23:34:31] PROBLEM - cassandra-a SSL 10.64.0.130:7000 on restbase1035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [23:35:32] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1035.eqiad.wmnet with reason: Bootstrapping — T354560 [23:35:37] yes [23:35:38] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [23:35:45] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1035.eqiad.wmnet with reason: Bootstrapping — T354560 [23:36:28] CI taking ages -- apparently it needs to run extension tests from 53 extensions in order to merge this change [23:38:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P57788 and previous config saved to /var/cache/conftool/dbconfig/20240222-233845-arnaudb.json [23:39:18] not sure why gate checks didn't run on the config patch but the main test build looks the same so I'm probably going to hit the submit button [23:40:27] (03Merged) 10jenkins-bot: OCR: Add HTTP proxy config [extensions/Wikisource] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005700 (https://phabricator.wikimedia.org/T357857) (owner: 10Tim Starling) [23:43:16] (03CR) 10Tim Starling: [C: 03+2] InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:46:05] (03Merged) 10jenkins-bot: block: Pass wikiId to DatabaseBlock::getId in DatabaseBlockStore [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005701 (https://phabricator.wikimedia.org/T358208) (owner: 10Zabe) [23:47:07] (03CR) 10Tim Starling: InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:47:12] (03PS2) 10Tim Starling: InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:48:32] (03CR) 10Tim Starling: [C: 03+2] InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:49:18] (03Merged) 10jenkins-bot: InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [23:49:55] !log tstarling@deploy2002 Started scap: (no justification provided) [23:53:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P57789 and previous config saved to /var/cache/conftool/dbconfig/20240222-235351-arnaudb.json [23:57:18] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded (10 MB interruption) - https://phabricator.wikimedia.org/T350917#9570643 (10Bawolff) More recent example is File:Delft_Van_Miereveltlaan_7.jpg (aka 1aqdj7jclxmc.afcsnk.1553787.jpg aka 1aqdj6vvp0a8.a3pwzj.1553787.jpg ) which is cut off... [23:59:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:59:36] !log tstarling@deploy2002 Finished scap: (no justification provided) (duration: 09m 40s) [23:59:46] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded - chunked upload drops last chunk. - https://phabricator.wikimedia.org/T350917#9570654 (10Bawolff) [23:59:53] ^zabe