[00:01:36] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:01:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035870 (owner: 10TrainBranchBot) [00:02:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:02:12] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:02:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P63369 and previous config saved to /var/cache/conftool/dbconfig/20240528-000213-marostegui.json [00:05:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:05:12] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:05:40] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:05:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T364069)', diff saved to https://phabricator.wikimedia.org/P63370 and previous config saved to /var/cache/conftool/dbconfig/20240528-000549-marostegui.json [00:05:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [00:05:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [00:05:56] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T364069)', diff saved to https://phabricator.wikimedia.org/P63371 and previous config saved to /var/cache/conftool/dbconfig/20240528-000602-marostegui.json [00:08:07] FIRING: KubernetesCalicoDown: wikikube-worker2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:17:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P63372 and previous config saved to /var/cache/conftool/dbconfig/20240528-001721-marostegui.json [00:27:54] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [00:28:48] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [00:32:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63373 and previous config saved to /var/cache/conftool/dbconfig/20240528-003230-marostegui.json [00:32:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:32:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:32:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:32:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T364299)', diff saved to https://phabricator.wikimedia.org/P63374 and previous config saved to /var/cache/conftool/dbconfig/20240528-003255-marostegui.json [00:46:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T364069)', diff saved to https://phabricator.wikimedia.org/P63375 and previous config saved to /var/cache/conftool/dbconfig/20240528-004559-marostegui.json [00:46:06] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:53:57] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9836312 (10Soda) [00:58:55] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [00:59:48] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [01:01:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P63376 and previous config saved to /var/cache/conftool/dbconfig/20240528-010107-marostegui.json [01:07:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.7 [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1035871 (https://phabricator.wikimedia.org/T361401) [01:07:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.7 [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1035871 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [01:09:42] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P63377 and previous config saved to /var/cache/conftool/dbconfig/20240528-011615-marostegui.json [01:27:24] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.7 [core] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1035871 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [01:31:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T364069)', diff saved to https://phabricator.wikimedia.org/P63378 and previous config saved to /var/cache/conftool/dbconfig/20240528-013123-marostegui.json [01:31:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [01:31:29] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:31:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [01:56:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9836351 (10KFrancis) Hi all, I'm confirming the NDA has been signed. Please proceed with next steps. Thanks! [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T0200) [02:06:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T364299)', diff saved to https://phabricator.wikimedia.org/P63379 and previous config saved to /var/cache/conftool/dbconfig/20240528-020647-marostegui.json [02:06:52] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:21:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P63380 and previous config saved to /var/cache/conftool/dbconfig/20240528-022155-marostegui.json [02:26:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [02:26:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [02:26:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T364069)', diff saved to https://phabricator.wikimedia.org/P63381 and previous config saved to /var/cache/conftool/dbconfig/20240528-022627-marostegui.json [02:26:34] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:36:48] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P63382 and previous config saved to /var/cache/conftool/dbconfig/20240528-023703-marostegui.json [02:50:23] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9836371 (10Chqaz) @Ladsgroup I do not know it. There likely be no governance issues. [02:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T364299)', diff saved to https://phabricator.wikimedia.org/P63383 and previous config saved to /var/cache/conftool/dbconfig/20240528-025211-marostegui.json [02:52:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [02:52:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:52:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [02:52:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T364299)', diff saved to https://phabricator.wikimedia.org/P63384 and previous config saved to /var/cache/conftool/dbconfig/20240528-025234-marostegui.json [02:56:48] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T0300) [03:01:38] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036339 (https://phabricator.wikimedia.org/T361401) [03:01:39] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036339 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [03:02:19] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036339 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [03:02:46] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.7 refs T361401 [03:02:51] T361401: 1.43.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T361401 [03:04:27] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T364069)', diff saved to https://phabricator.wikimedia.org/P63385 and previous config saved to /var/cache/conftool/dbconfig/20240528-031327-marostegui.json [03:13:34] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P63386 and previous config saved to /var/cache/conftool/dbconfig/20240528-032835-marostegui.json [03:30:57] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [03:41:56] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [03:43:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P63387 and previous config saved to /var/cache/conftool/dbconfig/20240528-034344-marostegui.json [03:49:29] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036251 (owner: 10L10n-bot) [03:58:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T364069)', diff saved to https://phabricator.wikimedia.org/P63388 and previous config saved to /var/cache/conftool/dbconfig/20240528-035852-marostegui.json [03:58:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [03:58:57] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:59:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [03:59:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63389 and previous config saved to /var/cache/conftool/dbconfig/20240528-035915-marostegui.json [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T0400) [04:02:43] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.7 refs T361401 (duration: 59m 56s) [04:02:48] T361401: 1.43.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T361401 [04:03:45] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.4 (duration: 03m 44s) [04:08:07] FIRING: KubernetesCalicoDown: wikikube-worker2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:08:36] (03PS1) 10Pppery: Export source strings and documentation again [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 [04:19:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T364299)', diff saved to https://phabricator.wikimedia.org/P63390 and previous config saved to /var/cache/conftool/dbconfig/20240528-041937-marostegui.json [04:19:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:26:16] (03CR) 10Pppery: "(This happened because the previous patch was merged while translatewiki was in the process of exporting changes, which confused the syste" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [04:31:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:31:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:33:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:33:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:34:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P63391 and previous config saved to /var/cache/conftool/dbconfig/20240528-043446-marostegui.json [04:44:21] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036345 (https://phabricator.wikimedia.org/T349774) [04:44:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63392 and previous config saved to /var/cache/conftool/dbconfig/20240528-044428-marostegui.json [04:44:35] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:45:36] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (lists1004), Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:46:39] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036345 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [04:47:40] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036345 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [04:48:19] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [04:48:40] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [04:48:41] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [04:49:29] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [04:49:30] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [04:49:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P63393 and previous config saved to /var/cache/conftool/dbconfig/20240528-044955-marostegui.json [04:50:01] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [04:56:40] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9836464 (10Marostegui) So far no issues in... [04:59:07] (03PS1) 10Stevemunene: Configure datahub-gms-next not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) [04:59:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P63394 and previous config saved to /var/cache/conftool/dbconfig/20240528-045936-marostegui.json [05:05:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T364299)', diff saved to https://phabricator.wikimedia.org/P63395 and previous config saved to /var/cache/conftool/dbconfig/20240528-050504-marostegui.json [05:05:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [05:05:10] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:05:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [05:05:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63396 and previous config saved to /var/cache/conftool/dbconfig/20240528-050527-marostegui.json [05:06:00] (03PS1) 10Marostegui: db2178: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036347 [05:06:24] (03CR) 10Marostegui: [C:03+2] db2178: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036347 (owner: 10Marostegui) [05:09:38] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1035872 (https://phabricator.wikimedia.org/T366038) [05:14:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P63397 and previous config saved to /var/cache/conftool/dbconfig/20240528-051444-marostegui.json [05:20:55] (03PS1) 10Marostegui: db2202: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036348 [05:21:18] (03CR) 10Marostegui: [C:03+2] db2202: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036348 (owner: 10Marostegui) [05:24:15] (03PS1) 10Abijeet Patro: Revert "Localisation updates from https://translatewiki.net." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036224 [05:29:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T364069)', diff saved to https://phabricator.wikimedia.org/P63398 and previous config saved to /var/cache/conftool/dbconfig/20240528-052952-marostegui.json [05:29:57] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:30:16] (03CR) 10Abijeet Patro: [V:03+1 C:03+1] Revert "Localisation updates from https://translatewiki.net." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036224 (owner: 10Abijeet Patro) [05:39:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:38] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:47:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:52:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:55:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:55:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T0600). [06:27:55] (03CR) 10CI reject: [V:04-1] Revert "Localisation updates from https://translatewiki.net." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036224 (owner: 10Abijeet Patro) [06:28:35] (03CR) 10Abijeet Patro: [V:03+2 C:03+1] Revert "Localisation updates from https://translatewiki.net." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036224 (owner: 10Abijeet Patro) [06:34:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63399 and previous config saved to /var/cache/conftool/dbconfig/20240528-063417-marostegui.json [06:34:24] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:41:02] (03CR) 10DCausse: wdqs.data-reload: support HDFS as a source (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [06:41:39] (03PS19) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [06:48:59] (03PS1) 10Marostegui: orchestrator: Remove kormat from powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1036482 [06:49:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P63400 and previous config saved to /var/cache/conftool/dbconfig/20240528-064926-marostegui.json [06:50:14] (03CR) 10Marostegui: [C:03+2] orchestrator: Remove kormat from powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1036482 (owner: 10Marostegui) [06:56:41] (03CR) 10Muehlenhoff: [C:03+2] Add new access group to grant root on the wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1036270 (owner: 10Muehlenhoff) [06:59:17] (03PS1) 10Ayounsi: Revert "preseed: Add kafka-main to first time seeding" [puppet] - 10https://gerrit.wikimedia.org/r/1036375 (https://phabricator.wikimedia.org/T363212) [06:59:52] (03PS2) 10Ayounsi: Revert "preseed: Add kafka-main to first time seeding" [puppet] - 10https://gerrit.wikimedia.org/r/1036375 (https://phabricator.wikimedia.org/T363212) [07:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:37] * kart_ is here [07:00:47] I'll start deploy [07:01:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036289 (https://phabricator.wikimedia.org/T366003) (owner: 10KartikMistry) [07:02:09] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036375 (https://phabricator.wikimedia.org/T363212) (owner: 10Ayounsi) [07:02:32] (03Merged) 10jenkins-bot: Section Translation: Enable in newly created Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036289 (https://phabricator.wikimedia.org/T366003) (owner: 10KartikMistry) [07:03:17] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1036289|Section Translation: Enable in newly created Wikipedias (T366003)]] [07:03:21] T366003: Complete the enablement of newly created wikipedias - https://phabricator.wikimedia.org/T366003 [07:04:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P63401 and previous config saved to /var/cache/conftool/dbconfig/20240528-070434-marostegui.json [07:06:16] !log kartik@deploy1002 kartik: Backport for [[gerrit:1036289|Section Translation: Enable in newly created Wikipedias (T366003)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:04] (03CR) 10Ayounsi: [C:03+2] Revert "preseed: Add kafka-main to first time seeding" [puppet] - 10https://gerrit.wikimedia.org/r/1036375 (https://phabricator.wikimedia.org/T363212) (owner: 10Ayounsi) [07:07:32] !log kartik@deploy1002 kartik: Continuing with sync [07:16:56] (03CR) 10Jelto: [C:03+2] external_clouds_vendors: add Vultr cloud [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [07:16:56] (03CR) 10Brouberol: Configure datahub-gms-next not to wait for upgrade before starting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [07:17:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS bookworm [07:17:49] (03CR) 10Muehlenhoff: [C:03+2] aptrepo: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1036241 (owner: 10Muehlenhoff) [07:19:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63402 and previous config saved to /var/cache/conftool/dbconfig/20240528-071942-marostegui.json [07:19:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [07:19:48] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:19:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [07:20:02] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9836573 (10ABran-WMF) [07:20:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T364299)', diff saved to https://phabricator.wikimedia.org/P63403 and previous config saved to /var/cache/conftool/dbconfig/20240528-072006-marostegui.json [07:20:15] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9836574 (10ABran-WMF) [07:22:01] (03PS2) 10Stevemunene: Configure datahub-gms-next not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) [07:23:08] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1036289|Section Translation: Enable in newly created Wikipedias (T366003)]] (duration: 19m 51s) [07:23:16] T366003: Complete the enablement of newly created wikipedias - https://phabricator.wikimedia.org/T366003 [07:23:42] I'm done with my config deployment. [07:23:43] (03CR) 10Jelto: [C:03+2] "manual run of dump_cloud_ip_ranges.service looks good and 181 ip ranges were added." [puppet] - 10https://gerrit.wikimedia.org/r/1036201 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [07:23:52] (03PS1) 10Marostegui: Revert "db1243: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1036376 [07:24:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Upgrade db1243 NICs firmware - https://phabricator.wikimedia.org/T365963#9836578 (10Marostegui) 05Open→03Declined Not needed anymore [07:24:11] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] wmf-laptop: Update changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1036186 (owner: 10Muehlenhoff) [07:24:30] (03CR) 10Stevemunene: Configure datahub-gms-next not to wait for upgrade before starting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [07:24:38] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9836585 (10ABran-WMF) [07:25:46] (03CR) 10Muehlenhoff: [C:03+2] kafka::mirror: Remove obsolete class parameter [puppet] - 10https://gerrit.wikimedia.org/r/1035329 (owner: 10Muehlenhoff) [07:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:21] (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [07:30:33] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye [07:31:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [07:32:12] PROBLEM - Categories update lag on wdqs1016 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:32:10.319573 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:32:12] PROBLEM - Categories update lag on wdqs1012 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:32:10.338118 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:32:12] PROBLEM - Categories update lag on wdqs1014 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:32:10.341301 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:32:12] PROBLEM - Categories update lag on wdqs1018 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:32:11.187218 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:32:12] PROBLEM - Categories update lag on wdqs1020 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:32:11.189614 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:34:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [07:35:17] PROBLEM - Categories update lag on wdqs2013 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.204119 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:17] PROBLEM - Categories update lag on wdqs2015 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.213766 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:17] PROBLEM - Categories update lag on wdqs2025 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.220243 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:17] PROBLEM - Categories update lag on wdqs2019 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.217785 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:17] PROBLEM - Categories update lag on wdqs2017 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.224242 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:18] PROBLEM - Categories update lag on wdqs2021 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.233198 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:18] PROBLEM - Categories update lag on wdqs2011 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:35:16.233634 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:35:23] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm [07:35:53] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 445, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:36:54] (03CR) 10Stevemunene: [C:03+2] Configure datahub-gms-next not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [07:37:38] (03Merged) 10jenkins-bot: Configure datahub-gms-next not to wait for upgrade before starting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036346 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [07:38:15] PROBLEM - Categories update lag on wdqs2007 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:38:14.323999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:38:20] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [07:40:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 523, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:41:15] PROBLEM - Categories update lag on wdqs2009 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:41:14.289159 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:41:56] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [07:42:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2203.codfw.wmnet [07:43:49] (03PS1) 10Muehlenhoff: Switch db2203 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036553 (https://phabricator.wikimedia.org/T349619) [07:44:35] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:46:36] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2001.codfw.wmnet with reason: host reimage [07:47:11] PROBLEM - Categories update lag on wdqs1017 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:47:10.301887 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:11] PROBLEM - Categories update lag on wdqs1011 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:47:10.313951 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:11] PROBLEM - Categories update lag on wdqs1015 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:47:10.317336 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:11] PROBLEM - Categories update lag on wdqs1013 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:47:10.387547 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:13] PROBLEM - Categories update lag on wdqs1019 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:47:11.813008 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:47:13] PROBLEM - Categories update lag on wdqs1021 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:47:11.860742 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:49:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2001.codfw.wmnet with reason: host reimage [07:50:12] PROBLEM - Categories update lag on wdqs2008 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:11.748122 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] PROBLEM - Categories update lag on wdqs2016 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.206062 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] PROBLEM - Categories update lag on wdqs2010 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.203159 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] PROBLEM - Categories update lag on wdqs2014 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.207987 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:14] PROBLEM - Categories update lag on wdqs2018 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.216290 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:15] PROBLEM - Categories update lag on wdqs2012 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.216386 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:15] PROBLEM - Categories update lag on wdqs2022 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.914091 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:15] PROBLEM - Categories update lag on wdqs2024 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:13.917198 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:50:16] PROBLEM - Categories update lag on wdqs2020 is CRITICAL: CRITICAL - Categories lag: 12 days, 2:50:14.056848 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [07:51:09] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc2049.codfw.wmnet with OS bookworm [07:51:48] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm [07:52:22] (03CR) 10Muehlenhoff: [C:03+2] Switch db2203 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036553 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:56:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1243.eqiad.wmnet with OS bookworm [08:00:38] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:01:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2203.codfw.wmnet [08:04:38] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:05:26] (03CR) 10Ayounsi: [C:03+2] sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [08:05:30] (03CR) 10Ayounsi: [C:03+2] sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [08:06:38] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:08:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1228 T364290', diff saved to https://phabricator.wikimedia.org/P63404 and previous config saved to /var/cache/conftool/dbconfig/20240528-080835-arnaudb.json [08:08:42] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [08:08:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1228.eqiad.wmnet with reason: reimage [08:09:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1228.eqiad.wmnet with reason: reimage [08:09:21] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage [08:09:24] (03Merged) 10jenkins-bot: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [08:09:25] (03Merged) 10jenkins-bot: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [08:09:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2001.codfw.wmnet with OS bullseye [08:10:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1228.eqiad.wmnet with OS bookworm [08:12:25] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage [08:22:03] !log jiji@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc2049.codfw.wmnet with OS bookworm [08:22:37] (03PS1) 10Muehlenhoff: profile::elasticsearch::cirrus: Remove obsolete http2 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1036556 [08:22:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [08:23:05] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit patchset upgrade [08:23:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit patchset upgrade [08:23:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036556 (owner: 10Muehlenhoff) [08:23:33] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc2049.codfw.wmnet [08:23:33] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit patchset upgrade [08:23:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit patchset upgrade [08:23:47] (03PS1) 10Stevemunene: admin_ng: create datahub-next namespace tlsHostnames and extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036557 (https://phabricator.wikimedia.org/T361185) [08:25:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [08:27:00] !log imported jenkins to 2.452.1 in component thirdparty/ci T366008 [08:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:05] T366008: Upgrade Jenkins instances to 2.452.1 - https://phabricator.wikimedia.org/T366008 [08:28:22] (03PS1) 10Marostegui: core_test.pp: Remove mariadb 10.2 and 10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1036560 [08:28:34] I am going to upgrade Gerrit from 3.8.5 to 3.8.6 , it will be unavailable for a couple minutes T365328 [08:28:35] T365328: Upgrade from Gerrit 3.8.5 to 3.8.6 - https://phabricator.wikimedia.org/T365328 [08:29:29] (03CR) 10Hashar: [C:03+2] "Jelto has kindly put the two hosts in scheduled downtime. Let's go!" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036286 (https://phabricator.wikimedia.org/T365328) (owner: 10Hashar) [08:30:18] (03Merged) 10jenkins-bot: Upgrade to Gerrit v3.8.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1036286 (https://phabricator.wikimedia.org/T365328) (owner: 10Hashar) [08:33:09] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c93e47d]: Gerrit to v3.8.6 on gerrit2002 - T365328 [08:33:17] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c93e47d]: Gerrit to v3.8.6 on gerrit2002 - T365328 (duration: 00m 08s) [08:34:38] (03PS2) 10Marostegui: core_test.pp: Remove mariadb 10.2 and 10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1036560 [08:35:27] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036560 (owner: 10Marostegui) [08:37:11] !log jiji@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [08:38:22] (03CR) 10Marostegui: [C:03+2] core_test.pp: Remove mariadb 10.2 and 10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1036560 (owner: 10Marostegui) [08:39:26] marostegui: s8 is acting weird :((( [08:39:28] it is a bit long cause I am double checking the deployment instructions [08:39:37] Amir1: in which way [08:39:50] check orch [08:40:06] It is all fine in orch [08:40:26] (03PS1) 10Zabe: Initial configuration for dtpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036564 (https://phabricator.wikimedia.org/T365220) [08:43:34] jouncebot: nowandnext [08:43:34] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [08:43:34] In 1 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1000) [08:45:07] !log Upgraded gerrit-replica.wikimedia.org from Gerrit 3.8.5 to 3.8.6 # T365328 [08:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:12] T365328: Upgrade from Gerrit 3.8.5 to 3.8.6 - https://phabricator.wikimedia.org/T365328 [08:45:44] (03CR) 10Zabe: [C:03+2] Initial configuration for dtpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036564 (https://phabricator.wikimedia.org/T365220) (owner: 10Zabe) [08:45:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1228.eqiad.wmnet with OS bookworm [08:46:30] (03Merged) 10jenkins-bot: Initial configuration for dtpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036564 (https://phabricator.wikimedia.org/T365220) (owner: 10Zabe) [08:46:31] I am upgrading the primary Gerrit now [08:47:33] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c93e47d]: Gerrit to v3.8.6 on gerrit1003 - T365328 [08:47:38] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c93e47d]: Gerrit to v3.8.6 on gerrit1003 - T365328 (duration: 00m 05s) [08:48:02] autoCreateUser failed for dtp: Automatic account creation is not allowed. [08:48:02] lol [08:48:04] taavi ^ [08:48:13] oh no [08:48:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T364299)', diff saved to https://phabricator.wikimedia.org/P63406 and previous config saved to /var/cache/conftool/dbconfig/20240528-084820-marostegui.json [08:48:25] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:48:26] !log create Wikipedia Central Dusun # T365220 [08:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:30] T365220: Create Wikipedia Central Dusun - https://phabricator.wikimedia.org/T365220 [08:48:36] hm? [08:49:35] your patch for creating local accounts is not working :/ [08:49:41] (on wiki creation) [08:49:42] !log Upgraded gerrit.wikimedia.org from Gerrit 3.8.5 to 3.8.6 # T365328 [08:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:15] !log zabe@deploy1002 Started scap: T365220 [08:50:17] why does it say "for dtp"? [08:50:20] hashar: BTW... I'm not sure when it started but the "reload" button on gerrit CRs stopped working for me a while ago [08:50:22] Gerrit is back [08:50:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1228 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63407 and previous config saved to /var/cache/conftool/dbconfig/20240528-085032-arnaudb.json [08:50:46] vgutierrez: yeah that got broken at some point, I think in 3.7 but that is restored again in 3.9 I think [08:51:18] hmm [08:51:23] it's dtpwiki [08:51:33] hashar: so it will be eventually back here :) [08:51:36] vgutierrez: https://phabricator.wikimedia.org/T360550 , I haven't really rushed that one since I have heard little complains about [RELOAD] not working. The good news is I will upgrade Gerrit 3.9 very soonish [08:51:38] yeah [08:51:40] almost as if the params aren't passed properly to the createLocal script [08:51:40] hopefully next week [08:51:48] should https://gerrit.wikimedia.org/g/mediawiki/extensions/CentralAuth/+/8832200c8666ddb7ba7fdd4114cf0622d5e62540/maintenance/createLocalAccount.php#21 use getArg( 'username' ) instead? [08:52:31] vgutierrez: thank you for the report! :) [08:52:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1219.eqiad.wmnet with reason: reimage [08:52:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1219.eqiad.wmnet with reason: reimage [08:52:56] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mc2049.codfw.wmnet [08:53:26] PROBLEM - Memcached on mc2049 is CRITICAL: connect to address 10.192.32.81 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [08:53:34] zabe: ping me when you've finished adding wikis and I'll update wikistats [08:53:49] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc2049.codfw.wmnet [08:54:22] !log zabe@deploy1002 zabe: T365220 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:54:29] T365220: Create Wikipedia Central Dusun - https://phabricator.wikimedia.org/T365220 [08:55:27] !log zabe@deploy1002 zabe: Continuing with sync [08:55:30] suer [08:55:33] * sure [08:56:16] PROBLEM - MariaDB Replica Lag: s2 on db2125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:18] PROBLEM - MariaDB Replica Lag: s2 on db2148 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:24] PROBLEM - MariaDB Replica Lag: s2 on db2138 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:24] PROBLEM - MariaDB Replica Lag: s2 on db2175 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:35] Amir1: ^ [08:56:40] PROBLEM - MariaDB Replica Lag: s2 on db2204 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:43] on it [08:56:54] checking too [08:57:48] Replication was stopped in db2125 [08:57:51] I fixed that one [08:57:56] I guess it is the same for the others? [08:58:44] I thought it's missing grants on master but it seems to be there [08:58:59] !log jiji@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [08:59:12] Amir1: db2125 seems to be fine and using the new user [08:59:15] so i don't think it was that [08:59:16] (03CR) 10Klausman: [C:03+1] ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos) [08:59:16] RECOVERY - MariaDB Replica Lag: s2 on db2125 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:59:17] let me fix the other hosts [08:59:25] It broke again I think [08:59:37] db2125 is fixed [08:59:38] Thanks [08:59:45] What is missing? [08:59:46] it caught up with the master [09:00:11] thanks [09:00:30] zabe: sorry about that, fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/1036569/ [09:01:00] *I think) [09:01:58] The recoveries will come soon [09:02:18] RECOVERY - MariaDB Replica Lag: s2 on db2148 is OK: OK slave_sql_lag Replication lag: 23.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:02:24] RECOVERY - MariaDB Replica Lag: s2 on db2138 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:03:22] !log jiji@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host mc2049.codfw.wmnet [09:03:24] RECOVERY - MariaDB Replica Lag: s2 on db2175 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:03:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P63408 and previous config saved to /var/cache/conftool/dbconfig/20240528-090328-marostegui.json [09:05:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1228 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63409 and previous config saved to /var/cache/conftool/dbconfig/20240528-090538-arnaudb.json [09:07:17] AAAAAH SEMI SYNC [09:07:22] AAAAAHHHH [09:07:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1219 T364290', diff saved to https://phabricator.wikimedia.org/P63410 and previous config saved to /var/cache/conftool/dbconfig/20240528-090724-arnaudb.json [09:07:29] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [09:07:36] PROBLEM - Disk space on backup1007 is CRITICAL: DISK CRITICAL - free space: /srv/objectstorage 6642824 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1007&var-datasource=eqiad+prometheus/ops [09:07:40] RECOVERY - MariaDB Replica Lag: s2 on db2204 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:08:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1219.eqiad.wmnet with OS bookworm [09:09:37] !log zabe@deploy1002 Finished scap: T365220 (duration: 19m 22s) [09:09:42] T365220: Create Wikipedia Central Dusun - https://phabricator.wikimedia.org/T365220 [09:10:34] (03PS1) 10Zabe: Stop writing to af_user(_text)/afh_user(_text) everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036573 (https://phabricator.wikimedia.org/T337920) [09:12:27] RhinosF1: ^ [09:12:50] zabe: is it just the one today? [09:13:02] yes [09:13:29] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=dtpwiki --cluster=all 2>&1 | tee /tmp/dtpwiki.UpdateSearchIndexConfig.log # T365220 [09:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:08] !log jelto@cumin1002 START - Cookbook sre.hosts.remove-downtime for gerrit2002.wikimedia.org [09:14:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit2002.wikimedia.org [09:14:39] !log jelto@cumin1002 START - Cookbook sre.hosts.remove-downtime for gerrit1003.wikimedia.org [09:14:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit1003.wikimedia.org [09:14:56] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036586 [09:14:56] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036586 (owner: 10Zabe) [09:15:06] zabe: perfect, thanks [09:15:50] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036586 (owner: 10Zabe) [09:16:03] (03CR) 10Zabe: [C:03+2] Stop writing to af_user(_text)/afh_user(_text) everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036573 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [09:16:12] (03PS1) 10Muehlenhoff: Send account check mails to SRE IF alias [puppet] - 10https://gerrit.wikimedia.org/r/1036574 [09:17:19] (03Merged) 10jenkins-bot: Stop writing to af_user(_text)/afh_user(_text) everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036573 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [09:17:56] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1036573|Stop writing to af_user(_text)/afh_user(_text) everywhere (T337920)]], [[gerrit:1036586|Update interwiki cache]] [09:18:01] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [09:18:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P63411 and previous config saved to /var/cache/conftool/dbconfig/20240528-091836-marostegui.json [09:18:57] (03PS1) 10Ladsgroup: mariadb: Use the new replication user on grants file [puppet] - 10https://gerrit.wikimedia.org/r/1036575 [09:20:36] !log zabe@deploy1002 zabe: Backport for [[gerrit:1036573|Stop writing to af_user(_text)/afh_user(_text) everywhere (T337920)]], [[gerrit:1036586|Update interwiki cache]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:20:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1228 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63412 and previous config saved to /var/cache/conftool/dbconfig/20240528-092046-arnaudb.json [09:21:21] !log zabe@deploy1002 zabe: Continuing with sync [09:21:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage [09:22:35] (03CR) 10Marostegui: [C:03+1] mariadb: Use the new replication user on grants file [puppet] - 10https://gerrit.wikimedia.org/r/1036575 (owner: 10Ladsgroup) [09:24:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage [09:24:49] (03PS1) 10Ladsgroup: cumin: Update mariadb replication user [puppet] - 10https://gerrit.wikimedia.org/r/1036579 [09:25:35] (03CR) 10Ladsgroup: [C:03+2] mariadb: Use the new replication user on grants file [puppet] - 10https://gerrit.wikimedia.org/r/1036575 (owner: 10Ladsgroup) [09:26:00] (03CR) 10Ladsgroup: [C:03+2] cumin: Update mariadb replication user [puppet] - 10https://gerrit.wikimedia.org/r/1036579 (owner: 10Ladsgroup) [09:26:38] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9836904 (10Soda) [09:28:21] (03PS2) 10Stevemunene: admin_ng: create datahub-next namespace tlsHostnames and extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036557 (https://phabricator.wikimedia.org/T361185) [09:32:10] (03CR) 10Brouberol: [C:03+1] "praise: Helm lint CI diff shows that this is doing exactly what we want" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036557 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [09:33:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T364299)', diff saved to https://phabricator.wikimedia.org/P63413 and previous config saved to /var/cache/conftool/dbconfig/20240528-093344-marostegui.json [09:33:50] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc2049.codfw.wmnet [09:33:51] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:34:17] (03CR) 10Slyngshede: [C:03+1] "LGTM, I was one of the people who assumed this belonged to clinic duty." [puppet] - 10https://gerrit.wikimedia.org/r/1036574 (owner: 10Muehlenhoff) [09:34:27] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc2049.codfw.wmnet [09:35:36] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bookworm [09:35:46] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1036573|Stop writing to af_user(_text)/afh_user(_text) everywhere (T337920)]], [[gerrit:1036586|Update interwiki cache]] (duration: 17m 49s) [09:35:53] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [09:35:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1228 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63414 and previous config saved to /var/cache/conftool/dbconfig/20240528-093552-arnaudb.json [09:35:59] (03CR) 10Muehlenhoff: [C:03+2] Send account check mails to SRE IF alias [puppet] - 10https://gerrit.wikimedia.org/r/1036574 (owner: 10Muehlenhoff) [09:37:18] (03CR) 10Stevemunene: [C:03+2] admin_ng: create datahub-next namespace tlsHostnames and extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036557 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [09:37:38] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:38:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2212.codfw.wmnet [09:39:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1243.eqiad.wmnet with reason: unknown lag [09:39:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1243.eqiad.wmnet with reason: unknown lag [09:39:42] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:19] (03Merged) 10jenkins-bot: admin_ng: create datahub-next namespace tlsHostnames and extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036557 (https://phabricator.wikimedia.org/T361185) (owner: 10Stevemunene) [09:40:20] (03PS1) 10Muehlenhoff: Switch db2212 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036583 (https://phabricator.wikimedia.org/T349619) [09:43:10] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:43:12] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:44:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db2212 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036583 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:45:19] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:45:35] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:46:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1219.eqiad.wmnet with OS bookworm [09:48:08] (03CR) 10Muehlenhoff: [C:03+2] openstack::base Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:49:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2212.codfw.wmnet [09:49:36] (03CR) 10Brouberol: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1035268 (https://phabricator.wikimedia.org/T365668) (owner: 10Stevemunene) [09:50:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1228 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63415 and previous config saved to /var/cache/conftool/dbconfig/20240528-095058-arnaudb.json [09:54:33] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage [09:57:40] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1000) [10:02:25] !log installing jinja2 security updates [10:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:36] (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.9 [puppet] - 10https://gerrit.wikimedia.org/r/1036609 (https://phabricator.wikimedia.org/T354656) [10:04:57] (03CR) 10Santiago Faci: [C:03+2] device-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [10:05:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2216.codfw.wmnet [10:05:30] (03CR) 10Hnowlan: [C:03+1] shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [10:06:05] (03Merged) 10jenkins-bot: device-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036260 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [10:07:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63416 and previous config saved to /var/cache/conftool/dbconfig/20240528-100752-arnaudb.json [10:08:22] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [10:08:57] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:27] (03PS1) 10Muehlenhoff: Switch db2216 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036611 (https://phabricator.wikimedia.org/T349619) [10:09:39] (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.9 [puppet] - 10https://gerrit.wikimedia.org/r/1036609 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [10:11:50] (03CR) 10Muehlenhoff: [C:03+2] Switch db2216 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036611 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:18:28] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [10:18:41] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9837047 (10MatthewVernon) @cmooney can you give me a time estimate for when you're going to be doing these, please? I'd like to put notes in my calendar. [10:21:29] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2049.codfw.wmnet with OS bookworm [10:21:49] 06SRE, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9837068 (10MatthewVernon) From a swift POV, should just be "check cluster is happy afterwards" [10:22:01] (03CR) 10Urbanecm: "wondering whether eswiki is the best choice, as they're a pilot (and have personalized praise in prod). Can we pick a different wiki for t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T359038) (owner: 10Sergio Gimeno) [10:22:28] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9837074 (10ABran-WMF) [10:22:38] (03PS3) 10Muehlenhoff: sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) [10:22:38] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9837077 (10ABran-WMF) [10:22:48] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9837072 (10MatthewVernon) [10:22:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63417 and previous config saved to /var/cache/conftool/dbconfig/20240528-102259-arnaudb.json [10:23:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2216.codfw.wmnet [10:23:04] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9837079 (10ABran-WMF) [10:23:14] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9837080 (10ABran-WMF) [10:23:24] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9837081 (10MatthewVernon) Again, from a swift POV, this should just be a case of checking the cluster is happy afterwards. [10:23:36] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9837083 (10ABran-WMF) [10:23:46] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9837085 (10ABran-WMF) [10:23:56] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9837084 (10MatthewVernon) [10:24:12] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9837086 (10ABran-WMF) [10:24:30] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9837089 (10ABran-WMF) [10:24:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9837091 (10ABran-WMF) [10:24:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9837092 (10ABran-WMF) [10:25:05] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9837093 (10ABran-WMF) [10:25:17] (03PS1) 10Sergio Gimeno: CommunityConfiguration: set feedback url instead of bug tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) [10:25:23] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9837099 (10ABran-WMF) [10:25:48] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9837103 (10MatthewVernon) [swift-wise, just need to check cluster OK afterwards] [10:26:12] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9837104 (10MatthewVernon) [10:26:27] (03PS1) 10Joal: Add wikidata history-dumps import to hdfs job [puppet] - 10https://gerrit.wikimedia.org/r/1036614 (https://phabricator.wikimedia.org/T364045) [10:26:51] (03CR) 10Muehlenhoff: [C:03+2] sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:27:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9837106 (10MatthewVernon) [10:27:17] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9837111 (10MatthewVernon) [swift-wise, cluster health just needs checking afterwards] [10:27:25] (03PS1) 10Effie Mouzeli: Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) [10:27:45] (03CR) 10CI reject: [V:04-1] Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:28:30] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9837115 (10MatthewVernon) [10:29:32] (03PS2) 10Effie Mouzeli: Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) [10:29:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9837117 (10MatthewVernon) [from a swift POV, just need to check cluster OK afterwards] [10:29:54] (03PS2) 10Aklapper: Export source strings and documentation again [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [10:30:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9837121 (10MatthewVernon) [10:31:12] (03PS3) 10Muehlenhoff: tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013) [10:31:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9837137 (10MatthewVernon) ms-fe1014 will need depooling before this work is done (and then repooling afterwards). There's a sta... [10:32:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:33:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:33:11] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9837166 (10MatthewVernon) [10:33:14] (03CR) 10Hnowlan: [C:03+1] Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:33:15] (03CR) 10Marostegui: [C:03+2] Revert "db1243: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1036376 (owner: 10Marostegui) [10:33:21] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9837167 (10MatthewVernon) [swift-wise, just need to check cluster OK afterwards] [10:34:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63418 and previous config saved to /var/cache/conftool/dbconfig/20240528-103428-root.json [10:34:35] (03CR) 10Hnowlan: Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:34:39] (03CR) 10Aklapper: "Urgh, nope, I need another rebase" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [10:35:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9837189 (10MatthewVernon) [10:35:22] (03PS1) 10Muehlenhoff: ml/etcd: remove obsolete certificites [puppet] - 10https://gerrit.wikimedia.org/r/1036619 [10:36:01] (03PS3) 10Aklapper: Export source strings and documentation again [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [10:36:28] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9837210 (10MatthewVernon) swift-wise, just need to check the cluster's happy afterwards. moss-be1003 is part of the apus Ceph cl... [10:37:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9837218 (10MatthewVernon) [10:38:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63419 and previous config saved to /var/cache/conftool/dbconfig/20240528-103805-arnaudb.json [10:40:19] (03CR) 10Aklapper: "I think I now confused myself - after a rebase (because perviously there was a merge conflict) this leaves three files?" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [10:40:57] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9837237 (10MatthewVernon) [10:41:00] (03PS3) 10Effie Mouzeli: Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) [10:42:00] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9837239 (10MatthewVernon) I'm on annual leave this day, so someone else will have to handle the ms frontend, which needs depooli... [10:42:32] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9837241 (10MatthewVernon) [10:44:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9837254 (10MatthewVernon) swift-wise, this should just be a case of checking the cluster is healthy afterwards. I'm on annual le... [10:44:49] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9837257 (10MatthewVernon) [10:45:15] (03PS1) 10Effie Mouzeli: Add wikikube-ctrl1001 to server SRV record for etcd 1 [dns] - 10https://gerrit.wikimedia.org/r/1036621 (https://phabricator.wikimedia.org/T353464) [10:45:18] (03PS1) 10Effie Mouzeli: Add wikikube-ctrl1002 to server SRV record for etcd 3 [dns] - 10https://gerrit.wikimedia.org/r/1036622 (https://phabricator.wikimedia.org/T353464) [10:45:20] (03PS1) 10Effie Mouzeli: Add wikikube-ctrl1003 to server SRV record for etcd 4 [dns] - 10https://gerrit.wikimedia.org/r/1036623 (https://phabricator.wikimedia.org/T353464) [10:45:44] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9837261 (10MatthewVernon) From the swift POV, this is just checking the cluster is happy afterwards. I'm on annual leave when th... [10:49:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63420 and previous config saved to /var/cache/conftool/dbconfig/20240528-104934-root.json [10:53:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63421 and previous config saved to /var/cache/conftool/dbconfig/20240528-105311-arnaudb.json [10:54:48] (03CR) 10Hnowlan: [C:03+1] "+1 provisional on puppet being stopped on wikikube-ctrl100[23]" [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:58:45] (03CR) 10Btullis: [C:03+2] Remove remaining references to snapshot1008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [10:59:17] (03CR) 10Hnowlan: [C:03+1] Add wikikube-ctrl1001 to server SRV record for etcd 1 [dns] - 10https://gerrit.wikimedia.org/r/1036621 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:59:25] jouncebot: now [10:59:25] For the next 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1000) [10:59:32] jouncebot: next [10:59:32] In 1 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1200) [11:00:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1163.eqiad.wmnet [11:01:22] (03PS1) 10Muehlenhoff: Switch db1163 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036625 (https://phabricator.wikimedia.org/T349619) [11:02:36] (03CR) 10Muehlenhoff: [C:03+2] Switch db1163 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036625 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:03:13] (03PS1) 10Btullis: Configure snapshot1017 to be the misc cron snapshot runner [puppet] - 10https://gerrit.wikimedia.org/r/1036626 (https://phabricator.wikimedia.org/T364455) [11:04:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63422 and previous config saved to /var/cache/conftool/dbconfig/20240528-110440-root.json [11:05:04] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2657/co" [puppet] - 10https://gerrit.wikimedia.org/r/1036626 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [11:08:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63423 and previous config saved to /var/cache/conftool/dbconfig/20240528-110817-arnaudb.json [11:09:13] (03CR) 10Effie Mouzeli: [C:03+2] Add wikikube-ctrl1001 to server SRV record for etcd 1 [dns] - 10https://gerrit.wikimedia.org/r/1036621 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [11:13:57] (03CR) 10Effie Mouzeli: [C:03+2] Add wikikube-ctrl100[1-3] as master_stacked 2 [puppet] - 10https://gerrit.wikimedia.org/r/1036615 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [11:19:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P63424 and previous config saved to /var/cache/conftool/dbconfig/20240528-111946-root.json [11:24:39] (03PS1) 10Effie Mouzeli: sites.pp: fix wikikube-master1001 role definition [puppet] - 10https://gerrit.wikimedia.org/r/1036629 [11:25:28] (03CR) 10Stevemunene: [C:03+2] trafficserver: add datahub-next redirects [puppet] - 10https://gerrit.wikimedia.org/r/1035268 (https://phabricator.wikimedia.org/T365668) (owner: 10Stevemunene) [11:25:28] (03CR) 10Hnowlan: [C:03+1] sites.pp: fix wikikube-master1001 role definition [puppet] - 10https://gerrit.wikimedia.org/r/1036629 (owner: 10Effie Mouzeli) [11:25:36] (03CR) 10Effie Mouzeli: [C:03+2] sites.pp: fix wikikube-master1001 role definition [puppet] - 10https://gerrit.wikimedia.org/r/1036629 (owner: 10Effie Mouzeli) [11:25:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036626 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [11:26:48] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1163.eqiad.wmnet [11:32:15] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [11:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:47] (03PS1) 10Ladsgroup: Remove kormat's home directory files [puppet] - 10https://gerrit.wikimedia.org/r/1036631 [11:33:54] (03PS2) 10Ladsgroup: Remove kormat's home directory files [puppet] - 10https://gerrit.wikimedia.org/r/1036631 [11:34:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Remove kormat's home directory files [puppet] - 10https://gerrit.wikimedia.org/r/1036631 (owner: 10Ladsgroup) [11:34:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P63425 and previous config saved to /var/cache/conftool/dbconfig/20240528-113451-root.json [11:38:59] (03PS1) 10Marostegui: db_maint_mapper_sal.py: Add arnaudb [software] - 10https://gerrit.wikimedia.org/r/1036632 [11:39:21] (03CR) 10CI reject: [V:04-1] db_maint_mapper_sal.py: Add arnaudb [software] - 10https://gerrit.wikimedia.org/r/1036632 (owner: 10Marostegui) [11:39:27] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:41] (03PS2) 10Marostegui: db_maint_mapper_sal.py: Add arnaudb [software] - 10https://gerrit.wikimedia.org/r/1036632 [11:40:11] (03PS1) 10Tim Starling: Create electionadmin group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036633 (https://phabricator.wikimedia.org/T209892) [11:42:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:43:11] (03CR) 10Dom Walden: [C:03+1] "I only have rights to +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036633 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling) [11:44:05] (03PS3) 10Marostegui: db_maint_mapper_sal.py: Add arnaudb [software] - 10https://gerrit.wikimedia.org/r/1036632 [11:44:27] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:49:27] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P63426 and previous config saved to /var/cache/conftool/dbconfig/20240528-114957-root.json [11:51:38] !log hnowlan@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2001.codfw.wmnet [11:56:22] (03CR) 10Tim Starling: [C:03+2] "In this repo you conventionally self-merge after a change is reviewed, because the change always needs to be immediately deployed by the o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036633 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling) [11:57:12] (03Merged) 10jenkins-bot: Create electionadmin group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036633 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling) [11:57:42] (03CR) 10Muehlenhoff: [C:03+2] tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:58:51] (03CR) 10Hnowlan: [C:03+2] Rename kubernetes2032 to wikikube-worker2002 [puppet] - 10https://gerrit.wikimedia.org/r/1034977 (https://phabricator.wikimedia.org/T365571) (owner: 10JMeybohm) [11:59:07] (03Abandoned) 10Muehlenhoff: conftool: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959215 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [11:59:27] FIRING: [9x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1200) [12:02:21] (03PS3) 10Ayounsi: Extend STORAGE_BACKEND config to support swift [software/netbox] - 10https://gerrit.wikimedia.org/r/980908 (https://phabricator.wikimedia.org/T310717) [12:05:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P63428 and previous config saved to /var/cache/conftool/dbconfig/20240528-120503-root.json [12:07:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1169.eqiad.wmnet [12:07:24] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2032 to wikikube-worker2002 [12:07:40] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [12:08:04] (03PS1) 10Muehlenhoff: Switch db1169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036636 (https://phabricator.wikimedia.org/T349619) [12:09:18] (03CR) 10Muehlenhoff: [C:03+2] Switch db1169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036636 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:09:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [12:09:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [12:09:42] !log installing glib2.0 security updates [12:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:10:01] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2032 to wikikube-worker2002 - hnowlan@cumin1002" [12:10:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:10:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:10:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:10:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T364299)', diff saved to https://phabricator.wikimedia.org/P63429 and previous config saved to /var/cache/conftool/dbconfig/20240528-121037-marostegui.json [12:10:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:11:27] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2032 to wikikube-worker2002 - hnowlan@cumin1002" [12:11:27] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:27] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2002 [12:12:11] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2002 [12:12:50] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2032 to wikikube-worker2002 [12:13:21] !log jiji@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl1001.eqiad.wmnet [12:14:27] FIRING: [10x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:54] (03PS1) 10Marostegui: check_private_data_report: Remove kormat [puppet] - 10https://gerrit.wikimedia.org/r/1036638 [12:15:40] (03CR) 10Hnowlan: [C:03+1] Add wikikube-ctrl1002 to server SRV record for etcd 3 [dns] - 10https://gerrit.wikimedia.org/r/1036622 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [12:15:41] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Remove kormat [puppet] - 10https://gerrit.wikimedia.org/r/1036638 (owner: 10Marostegui) [12:15:56] (03CR) 10Effie Mouzeli: [C:03+2] Add wikikube-ctrl1002 to server SRV record for etcd 3 [dns] - 10https://gerrit.wikimedia.org/r/1036622 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [12:18:55] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2002.codfw.wmnet with OS bullseye [12:18:58] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2002.codfw.wmnet with OS bullseye [12:19:27] FIRING: [9x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:44] (03CR) 10DCausse: [C:03+1] profile::elasticsearch::cirrus: Remove obsolete http2 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1036556 (owner: 10Muehlenhoff) [12:20:54] FIRING: [14x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:10] ^ trying to deploy a single-file change but k8s is very slow [12:24:12] just failed with [12:24:16] 12:23:11 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. [12:24:27] FIRING: [11x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1218 T364290', diff saved to https://phabricator.wikimedia.org/P63431 and previous config saved to /var/cache/conftool/dbconfig/20240528-122442-arnaudb.json [12:24:45] Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition [12:24:48] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [12:24:54] (03PS1) 10Arturo Borrero Gonzalez: toolforge: drop toolforge-tfb-psp [puppet] - 10https://gerrit.wikimedia.org/r/1036640 (https://phabricator.wikimedia.org/T279110) [12:24:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1218.eqiad.wmnet with reason: reimage [12:25:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1218.eqiad.wmnet with reason: reimage [12:25:16] oh great now it's rolling back [12:25:29] we're doing maintenance on the masters at the moment [12:25:40] could be related but unclear [12:25:45] (03PS1) 10Ayounsi: move-vlan: remove unused variable definition [cookbooks] - 10https://gerrit.wikimedia.org/r/1036642 (https://phabricator.wikimedia.org/T350152) [12:25:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1218.eqiad.wmnet with OS bookworm [12:25:54] FIRING: [29x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:26:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:26:58] (03CR) 10Hnowlan: [C:03+1] move-vlan: remove unused variable definition [cookbooks] - 10https://gerrit.wikimedia.org/r/1036642 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [12:27:22] what should I do about this? revert in git or retry or leave it for now? [12:27:42] !log installing jetty9 security updates [12:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:31] TimStarling: I'd retry [12:28:55] there was a spike in various API latencies but things are recovering [12:29:51] ok [12:29:53] (03CR) 10Ayounsi: [C:03+2] move-vlan: remove unused variable definition [cookbooks] - 10https://gerrit.wikimedia.org/r/1036642 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [12:30:30] !log tstarling@deploy1002 Synchronized wmf-config/core-Permissions.php: create electionadmin group on testwiki T209892 (duration: 31m 52s) [12:30:36] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [12:30:54] RESOLVED: [29x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:45] if it's still going in half an hour, I will be looking to hand over ownership, it's getting late here [12:32:42] (03PS3) 10Muehlenhoff: nginx: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812174 (https://phabricator.wikimedia.org/T308013) [12:33:37] (03Merged) 10jenkins-bot: move-vlan: remove unused variable definition [cookbooks] - 10https://gerrit.wikimedia.org/r/1036642 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [12:34:27] FIRING: [15x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:47] FIRING: HelmReleaseBadStatus: Helm release mw-api-int/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:37:05] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet [12:37:21] (03CR) 10Elukey: [C:03+2] Move thanos-fe1002's envoy to CFSSL/PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036284 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:37:26] 12:35:09 Started sync-prod-k8s [12:37:26] The connection to the server kubemaster.svc.eqiad.wmnet:6443 was refused - did you specify the right host or port? [12:38:24] !log move thanos-fe1002's envoy TLS cert to CFSSL/PKI - T344324 [12:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:30] T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [12:38:33] (03CR) 10Jforrester: [C:03+1] "Ack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [12:38:39] (03CR) 10Jforrester: [C:03+1] "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [12:38:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage [12:39:27] FIRING: [19x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:43] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe1002.eqiad.wmnet [12:40:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-api-int/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:41:16] I pressed ctrl-C, it was just a series of connection refused errors [12:41:23] it's started to roll back, not sure if that will succeed [12:41:31] who is running the backport window? can someone volunteer to complete deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1036633 once k8s is fixed? [12:41:49] yeah, rollback is failing [12:42:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage [12:44:27] FIRING: [19x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:57] I only have half an hour at the beginning of the backport window [12:47:16] !log tstarling@deploy1002 Synchronized wmf-config/core-Permissions.php: create electionadmin group on testwiki T209892 (attempt 2 after k8s-related rollback) (duration: 16m 02s) [12:47:21] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [12:47:29] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [12:49:02] Lucas_WMDE: we will have to wait for the kubernetes problem to be sorted [12:49:12] FIRING: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:27] FIRING: [19x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:43] (03PS1) 10Elukey: role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) [12:49:49] !incidents [12:49:50] 4703 (ACKED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [12:49:58] I acked it kamila_ [12:51:04] thanks marostegui [12:51:11] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [12:51:51] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:52:09] effie and I are looking at this, unclear what's wrong. pybal is failing to connect to the masters [12:52:23] which in turn looks like the apiservers failing [12:52:28] !log installing python-urllib3 security updates [12:52:30] the kubernetes apiservers [12:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:57] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:12] RESOLVED: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:27] FIRING: [19x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:41] some recovery, but no good reason as to why [12:55:51] weird [12:56:40] maybe this worked okay in codfw because requests are a lot lower and we didn't have a deploy in the middle of it [12:56:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:56:54] RESOLVED: [8x] KubernetesAPILatency: High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:58:39] !log testing fifo-log-demux 0.7.5 on cp3081 and cp3073 [12:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:25] marostegui: will coordinate in -sre [12:59:27] FIRING: [13x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1300). [13:00:05] denisse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] I’m around but only for half an hour [13:00:24] so it would probably be better for someone else to deploy [13:00:34] (also it sounds like k8s issues aren’t fully resolved yet) [13:00:36] please don't deploy right now [13:00:40] ack [13:00:48] thanks Lucas_WMDE [13:01:03] Here. [13:02:26] I’ve also added TimStarling’s change to the backport window on-wiki, just for visibility [13:02:48] thanks [13:03:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1218.eqiad.wmnet with OS bookworm [13:03:27] (03PS1) 10Ssingh: P:dns:::auth: fix indentation for check_state.erb [puppet] - 10https://gerrit.wikimedia.org/r/1036644 [13:04:45] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2002.codfw.wmnet with OS bullseye [13:04:54] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2659/co" [puppet] - 10https://gerrit.wikimedia.org/r/1036644 (owner: 10Ssingh) [13:04:55] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host [13:06:51] !log installing man-db bugfix updates [13:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:42] Lucas_WMDE, denisse it looks like k8s issues are resolved, you can deploy now [13:07:49] alright… [13:07:58] denisse: is it okay if we do TimStarling’s change first? as it’s already merged [13:08:23] Lucas_WMDE: btw do I have time to add to the window my mediawiki config [13:08:27] @Lucas_WMDE: That's totally okay. [13:08:54] effie: you can add it, but I’ll be gone at half past [13:09:04] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1036633|Create electionadmin group on testwiki (T209892)]] [13:09:06] so it’s not even certain I’ll be able to deploy denisse’s change, I think :S [13:09:10] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [13:09:11] !log sudo cumin 'A:dnsbox' 'disable-puppet "merging CR 1036644"' [13:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:27] FIRING: [10x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:29] but if you want to self-serve that would be fine by me [13:09:35] Lucas_WMDE: I don't mind, shall I +2 it? [13:09:50] not yet please [13:10:01] once I’m done deploying ^^ [13:10:06] Lucas_WMDE: ah [13:10:06] nothing stands out in k8s events in logstash (yet) [13:10:06] unless it’s a backport where CI takes longer? [13:10:25] 13:09:54 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-28-130912-publish (ran as mwdeploy@kubernetes2032.codfw.wmnet) returned [255]: ssh: Could not resolve hostname kubernetes2032.codfw.wmnet: Name or service not known [13:10:25] 13:10:11 docker_pull_k8s: 100% (in-flight: 0; ok: 371; fail: 1; left: 0) [13:10:25] 13:10:11 1 K8s nodes failed to pull the multiversion image [13:10:27] Lucas_WMDE: no idea, I was thinking about adding it to the backport you are doing now [13:10:28] ^ scap output [13:10:36] but you were mid-flight [13:10:55] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns:::auth: fix indentation for check_state.erb [puppet] - 10https://gerrit.wikimedia.org/r/1036644 (owner: 10Ssingh) [13:11:37] hnowlan: SAL says you renamed kubernetes2032 to wikikube-worker2002 earlier today, so I assume the scap output above is relevant to you? [13:11:44] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and tstarling: Backport for [[gerrit:1036633|Create electionadmin group on testwiki (T209892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:55] !log installing bzip2 bugfix updates [13:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:13] https://test.wikipedia.org/wiki/Special:ListGroupRights shows electionadmins on k8s-mwdebug, I assume that’s enough confirmation [13:12:14] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and tstarling: Continuing with sync [13:12:30] Lucas_WMDE: yes it is, we wil sort this problem too [13:12:35] okay :) [13:12:56] effie: I’m not sure I understand you correctly… do you have a Gerrit change you wanted to deploy? [13:13:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T364299)', diff saved to https://phabricator.wikimedia.org/P63432 and previous config saved to /var/cache/conftool/dbconfig/20240528-131325-marostegui.json [13:13:31] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:14:14] effie: yes, that is me unfortunately - can you continue? [13:15:15] I’m still continuing [13:15:19] (10% through k8s deployment atm) [13:16:09] ack [13:18:32] (03PS1) 10Muehlenhoff: Switch db1169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036667 (https://phabricator.wikimedia.org/T349619) [13:18:48] (03PS2) 10Muehlenhoff: Switch db1169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036667 (https://phabricator.wikimedia.org/T349619) [13:19:20] (oops, I think I meant “still deploying” not “still continuing” ^^) [13:19:27] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:27] (60% now) [13:20:55] !log sudo cumin -b1 -s120 'A:dnsbox and not P{dns6001*}' 'run-puppet-agent --enable "merging CR 1036644"' [13:20:57] I’m not sure if the k8s deployment is slower than usual or if I’m just imagining things… [13:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:12] FIRING: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:18] k8s events in logstash flow normally and don't show anything worrying yet [13:23:33] ah kubemaster1002... hmmm [13:23:40] sigh what now [13:24:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63433 and previous config saved to /var/cache/conftool/dbconfig/20240528-132407-arnaudb.json [13:24:27] FIRING: [10x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:46] well, I guess it's not quite resolved [13:25:30] I'll start an incident doc [13:25:31] hi, just getting online, how can I help? [13:26:01] Lucas_WMDE: how's the deployment going? [13:26:08] 72% [13:26:11] (03PS4) 10Bking: dse-k8s: add airflow-analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) [13:26:13] (still in k8s, not bare-metal) [13:27:37] wait, how is it going BACKWARDS [13:27:41] I just saw it tick from 71% to 70% [13:27:46] incident doc https://docs.google.com/document/d/12QY-N1oXRwY4tPHO0fwrvf2osvZnr-2Vjfl_3pAOjE4/edit?usp=sharing [13:28:04] “ok” is going down and “left” going up o_O [13:28:10] (fail 0) [13:28:12] RESOLVED: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P63434 and previous config saved to /var/cache/conftool/dbconfig/20240528-132833-marostegui.json [13:28:40] Lucas_WMDE: does that mean that it is progressing ? [13:28:48] it’s… regressing? [13:28:51] I have no idea [13:29:02] it’s now back down to 67% [13:29:11] it's rolling back? [13:29:14] ok 1344, left 660 [13:29:25] ok 1340 left 664 [13:29:40] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:29:52] also, I’m sorry but I’m about to be in a meeting where I won’t be able to pay any attention to this :S [13:29:58] should I let it run or Ctrl+C? [13:30:07] (03CR) 10Brouberol: [C:03+1] Configure snapshot1017 to be the misc cron snapshot runner [puppet] - 10https://gerrit.wikimedia.org/r/1036626 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [13:30:08] (I’m not in a server-side tmux, I’m afraid, only a local one that you can’t attach to) [13:30:45] (03PS2) 10Stevemunene: trafficserver: add datahub redirects to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1035731 (https://phabricator.wikimedia.org/T365668) [13:30:48] (03PS2) 10Stevemunene: provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) [13:31:23] (03CR) 10Muehlenhoff: [C:03+2] Switch db1169 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036667 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:31:26] Lucas_WMDE: let is run, it is rolling back [13:31:29] ok [13:31:54] FIRING: [27x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:32:29] we are coordinating in -sre btw in order to avoid bots messing with our chats [13:34:12] FIRING: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1169.eqiad.wmnet [13:35:46] (03CR) 10Muehlenhoff: [C:03+2] nginx: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812174 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:35:48] I did a fuckup, actual incident doc https://docs.google.com/document/d/1vTSjrC7Pm4vRhv0tE6xjJGPo5SG-KPEytB2CpASzoVc/edit?usp=sharing [13:36:06] (03CR) 10Brouberol: [C:03+1] provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [13:36:14] (03CR) 10Brouberol: [C:03+1] trafficserver: add datahub redirects to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1035731 (https://phabricator.wikimedia.org/T365668) (owner: 10Stevemunene) [13:36:54] FIRING: [40x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:38:02] (03CR) 10Krinkle: [multiversion] Add 'manage-dblist init-labs' subcommand (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036313 (owner: 10Gergő Tisza) [13:39:12] RESOLVED: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63435 and previous config saved to /var/cache/conftool/dbconfig/20240528-133913-arnaudb.json [13:41:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1207 T364290', diff saved to https://phabricator.wikimedia.org/P63436 and previous config saved to /var/cache/conftool/dbconfig/20240528-134150-arnaudb.json [13:41:54] FIRING: [39x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:58] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [13:42:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1207.eqiad.wmnet with reason: reimage [13:42:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1207.eqiad.wmnet with reason: reimage [13:42:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1207.eqiad.wmnet with OS bookworm [13:43:34] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1036633|Create electionadmin group on testwiki (T209892)]] (duration: 34m 29s) [13:43:40] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [13:43:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P63437 and previous config saved to /var/cache/conftool/dbconfig/20240528-134341-marostegui.json [13:43:44] scap finished, will post details when meeting over [13:44:14] thanks Lucas_WMDE effie is about to put in a scap lock while we sort k8s [13:44:20] ack [13:44:27] FIRING: [10x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:55] (03PS1) 10Muehlenhoff: nginx: Remove prometheus.lua [puppet] - 10https://gerrit.wikimedia.org/r/1036672 [13:45:47] Hi! I have a backport I scheduled for yesterday, but forgot that yesterday was a US holiday. is someone doing the UTC afternoon backport now? If so, could I ask for some help deploying before that window closes? [13:46:14] !log jiji@deploy1002 Locking from deployment [ALL REPOSITORIES]: Kubernetes masters trouble - no deployments - serviceops [13:46:27] (ah, backscroll didn't load, reading now...) [13:46:46] all deployments are blocked at the moment [13:46:47] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:46:54] RESOLVED: [38x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:47:33] Lucas_WMDE: got it thank you. [13:48:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036672 (owner: 10Muehlenhoff) [13:49:27] FIRING: [10x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:47] FIRING: [5x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:53:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:54:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63438 and previous config saved to /var/cache/conftool/dbconfig/20240528-135419-arnaudb.json [13:55:09] !log installing pillow security updates [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:11] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) (owner: 10Sergio Gimeno) [13:58:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T364299)', diff saved to https://phabricator.wikimedia.org/P63439 and previous config saved to /var/cache/conftool/dbconfig/20240528-135848-marostegui.json [13:58:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:58:54] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:59:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T364299)', diff saved to https://phabricator.wikimedia.org/P63440 and previous config saved to /var/cache/conftool/dbconfig/20240528-135912-marostegui.json [14:01:02] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=wikikube-ctrl1001.eqiad.wmnet [14:02:00] effie, akosiaris and others: I put the scap output in https://phabricator.wikimedia.org/P63441 now [14:02:46] Lucas_WMDE: can you give me access to that paste? [14:03:02] done [14:03:08] thanks! [14:03:28] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:03:31] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:03:54] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [14:04:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1374 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:06:47] FIRING: [5x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:07:32] !log akosiaris@cumin1002 conftool action : set/weight=5; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet [14:08:38] !log akosiaris@cumin1002 conftool action : set/weight=1; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet [14:09:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63442 and previous config saved to /var/cache/conftool/dbconfig/20240528-140925-arnaudb.json [14:09:41] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [14:10:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [14:10:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:11:48] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:50] (03Abandoned) 10Pppery: Export source strings and documentation again [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [14:14:39] (03CR) 10Pppery: "Superseded by https://gerrit.wikimedia.org/r/c/phabricator/translations/+/1036224" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036342 (owner: 10Pppery) [14:14:48] uncommitted DNS changes are: [14:14:49] +wikikube-worker2002 1H IN A 10.192.14.9 [14:14:53] +wikikube-worker2002 1H IN AAAA 2620:0:860:10f:10:192:14:9 [14:15:14] just as an FYI in case it is relevant to the current debugging [14:16:13] https://phabricator.wikimedia.org/P63443 [14:18:16] (03PS1) 10Effie Mouzeli: kubernetes stacked masters etcd: allow etcd clients [puppet] - 10https://gerrit.wikimedia.org/r/1036680 [14:19:35] (03CR) 10CDanis: [C:03+1] kubernetes stacked masters etcd: allow etcd clients [puppet] - 10https://gerrit.wikimedia.org/r/1036680 (owner: 10Effie Mouzeli) [14:20:00] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:20:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:21:00] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:21:30] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet [14:21:55] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/1036680/2660/wikikube-ctrl1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1036680 (owner: 10Effie Mouzeli) [14:22:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:22:57] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes stacked masters etcd: allow etcd clients [puppet] - 10https://gerrit.wikimedia.org/r/1036680 (owner: 10Effie Mouzeli) [14:23:12] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:24:27] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63444 and previous config saved to /var/cache/conftool/dbconfig/20240528-142431-arnaudb.json [14:25:49] !log enabling puppet on wikikube-ctrl100[1-2]* [14:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1184.eqiad.wmnet [14:27:40] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [14:28:12] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:28:30] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1001.eqiad.wmnet [14:28:37] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host db1207.eqiad.wmnet with OS bookworm [14:29:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1207.eqiad.wmnet with OS bookworm [14:29:59] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1001.eqiad.wmnet [14:31:24] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [14:31:50] (03PS1) 10Muehlenhoff: Switch db1184 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036683 (https://phabricator.wikimedia.org/T349619) [14:34:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1374 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:36:48] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on kubernetes1037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:38:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:00] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:43:02] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:43:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage [14:43:12] FIRING: [2x] ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:02] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:44:33] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:46:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage [14:47:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:11] (03CR) 10Muehlenhoff: [C:03+2] Switch db1184 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036683 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:48:12] RESOLVED: [2x] ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:38] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2002 - hnowlan@cumin1002" [14:49:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2002 - hnowlan@cumin1002" [14:49:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:32] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2002.codfw.wmnet 223.16.192.10.in-addr.arpa 3.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2002.codfw.wmnet 223.16.192.10.in-addr.arpa 3.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:36] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2002 [14:50:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2002 [14:50:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [14:51:20] :) [14:51:41] (03Abandoned) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [14:52:33] sukhe: just finishing the run that paused earlier, apologies for the diff noise [14:53:08] (03Abandoned) 10Muehlenhoff: prometheus::mysqld_exporter: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff) [14:53:17] hnowlan: no problem at all and thanks. the only reason we care about this is that if DNS changes are pending on netbox, they block all other updates till those are merged [14:53:21] thanks for taking care of it [14:53:38] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:53:52] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:27] (03CR) 10Dzahn: "ah, thanks, I should have seen this coming and added something like this when adding the automation class" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (https://phabricator.wikimedia.org/T365804) (owner: 10EoghanGaffney) [14:54:29] (03PS1) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) [14:54:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1184.eqiad.wmnet [14:55:21] (03PS5) 10CDobbins: purged: set use_pki to true for cp6001 in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) [14:55:40] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:56:48] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:49] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2661/console" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:59:27] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:04] eoghan, jelto, arnoldokoth, and mutante: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1500) [15:04:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:04:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:04:12] FIRING: [2x] ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:46] hi [15:04:51] rzl: hi we're discussing in #-sre [15:04:56] https://docs.google.com/document/d/1vTSjrC7Pm4vRhv0tE6xjJGPo5SG-KPEytB2CpASzoVc/edit [15:05:01] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [15:05:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:05:12] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1186.eqiad.wmnet [15:05:17] cdanis: thanks yeah, I saw there was context but I'm still getting caught up [15:05:19] !incidents [15:05:19] 4708 (ACKED) [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad) [15:05:20] 4707 (RESOLVED) [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad) [15:05:20] 4706 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [15:05:20] 4705 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [15:05:20] 4703 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [15:06:06] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:06:14] (03PS1) 10Muehlenhoff: Switch db1186 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036689 (https://phabricator.wikimedia.org/T349619) [15:06:44] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [15:07:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1207.eqiad.wmnet with OS bookworm [15:09:12] RESOLVED: [2x] ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:13] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2002.codfw.wmnet with reason: host reimage [15:09:27] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:34] (03CR) 10Vgutierrez: "this patch could use a VTC test" [puppet] - 10https://gerrit.wikimedia.org/r/1035011 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [15:11:00] (03CR) 10Muehlenhoff: [C:03+2] Switch db1186 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036689 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:12:04] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [15:12:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2002.codfw.wmnet with reason: host reimage [15:13:57] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [15:15:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1186.eqiad.wmnet [15:18:31] (03CR) 10Ladsgroup: [C:03+1] lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:18:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1206.eqiad.wmnet [15:18:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on kubernetes1037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:18:57] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:51] (03PS1) 10Muehlenhoff: Switch db1206 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036691 (https://phabricator.wikimedia.org/T349619) [15:21:15] (03CR) 10Muehlenhoff: [C:03+2] Switch db1206 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036691 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:21:40] (03PS3) 10Marostegui: mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) [15:23:17] (03CR) 10Ladsgroup: [C:03+2] db_maint_mapper_sal.py: Add arnaudb [software] - 10https://gerrit.wikimedia.org/r/1036632 (owner: 10Marostegui) [15:23:51] (03Abandoned) 10Muehlenhoff: Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/686583 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [15:24:24] (03PS2) 10Muehlenhoff: Deprecate system::role for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1032636 [15:24:33] (03PS6) 10Aaron Schulz: Set "templateOverridesBySection" in an etcd.php loop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 [15:24:44] (03CR) 10CI reject: [V:04-1] Set "templateOverridesBySection" in an etcd.php loop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 (owner: 10Aaron Schulz) [15:28:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1206.eqiad.wmnet [15:28:07] (03Merged) 10jenkins-bot: db_maint_mapper_sal.py: Add arnaudb [software] - 10https://gerrit.wikimedia.org/r/1036632 (owner: 10Marostegui) [15:28:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on kubernetes1037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:29:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:29:27] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:38] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [15:30:06] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:31:28] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [15:32:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2002.codfw.wmnet with OS bullseye [15:33:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on kubernetes1037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:34:03] (03CR) 10Jcrespo: "Looks fine, but could you run a puppet compiler run of backup1001, backup2001, backup1002, backup2003, backup2004, dbprov1001, dbprov2005," [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [15:35:11] !log sudo cumin 'A:dnsbox' 'disable-puppet "merging CR 1034476"' [15:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:49] (03CR) 10Jcrespo: "Note also analytics backup is owned by DE, not DP (I don't think it is a problem, but you may want to add btullis to the patch)." [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [15:35:49] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [15:36:13] (03CR) 10Ssingh: [C:03+2] dns::auth::monitoring: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1034476 (owner: 10Muehlenhoff) [15:36:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63447 and previous config saved to /var/cache/conftool/dbconfig/20240528-153622-arnaudb.json [15:37:25] (03CR) 10Jcrespo: "Aside from those things, virtual +1 if the compiler only has trivial role changes/noop there." [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [15:38:14] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: phabricator deploy [15:38:27] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: phabricator deploy [15:38:39] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: phabricator deploy [15:38:52] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: phabricator deploy [15:39:18] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phabricator.wikimedia.org with reason: phabricator deploy [15:39:19] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phabricator.wikimedia.org with reason: phabricator deploy [15:39:27] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:40:42] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: phabricator deploy [15:40:53] !log jiji@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Kubernetes masters trouble - no deployments - serviceops (duration: 114m 39s) [15:40:55] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab.wmfusercontent.org with reason: phabricator deploy [15:41:16] (03PS3) 10Muehlenhoff: Deprecate system::role for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1032636 [15:41:36] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phabricator.wikimedia.org with reason: phabricator deploy [15:41:37] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phabricator.wikimedia.org with reason: phabricator deploy [15:41:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [15:42:14] !incidents [15:42:14] You're not allowed to perform this action. [15:42:28] !incidents [15:42:28] 4708 (RESOLVED) [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad) [15:42:28] 4707 (RESOLVED) [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad) [15:42:28] 4706 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [15:42:29] 4705 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [15:42:29] 4703 (RESOLVED) ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [15:42:58] brennen: we 've just unlocked a global scap lock due to an incident we track in https://docs.google.com/document/d/1vTSjrC7Pm4vRhv0tE6xjJGPo5SG-KPEytB2CpASzoVc/edit [15:43:11] akosiaris: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1036633 ? [15:43:20] akosiaris: thanks - are we ok to go ahead with a phabricator deploy, or should we hold on that? [15:43:27] Amir1: that's the one, go ahead [15:43:38] brennen: phab is ok, go ahead [15:43:42] kk [15:43:45] should be brief. [15:43:57] !log brennen@deploy1002 Started deploy [phabricator/deployment@e7093e2]: deploy phab2002 for T366075 [15:44:00] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1036633|Create electionadmin group on testwiki (T209892)]] [15:44:02] T366075: Deploy Phabricator/Phorge 2024-05-28 - https://phabricator.wikimedia.org/T366075 [15:44:08] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [15:44:31] !log brennen@deploy1002 Finished deploy [phabricator/deployment@e7093e2]: deploy phab2002 for T366075 (duration: 00m 33s) [15:44:56] !log brennen@deploy1002 Started deploy [phabricator/deployment@e7093e2]: deploy phab1004 for T366075 [15:45:28] !log brennen@deploy1002 Finished deploy [phabricator/deployment@e7093e2]: deploy phab1004 for T366075 (duration: 00m 32s) [15:45:30] !log sudo cumin -b1 -s120 'A:dnsbox and not P{dns6001*}' 'run-puppet-agent --enable "merging CR 1034476"' [15:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:29] (03CR) 10Ssingh: "Looks good, let's run for cp4052 as well and we can wrap this up. It should be a NOOP there." [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [15:48:06] !log ladsgroup@deploy1002 tstarling and ladsgroup: Backport for [[gerrit:1036633|Create electionadmin group on testwiki (T209892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:48:20] !log ladsgroup@deploy1002 tstarling and ladsgroup: Continuing with sync [15:48:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes1037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:49:06] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:49:48] Amir1: let's hope this does not repeat the NDA bits from last time! fingers crossed :) [15:50:22] urbanecm: You take care of it [15:50:24] :D [15:50:29] I'm just a deployer [15:50:48] !log ran `sudo puppet node deactivate kubernetes2032.codfw.wmnet` to fix renamed host erroring in scap [15:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63448 and previous config saved to /var/cache/conftool/dbconfig/20240528-155129-arnaudb.json [15:51:47] RESOLVED: [4x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:53:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1206 T364290', diff saved to https://phabricator.wikimedia.org/P63449 and previous config saved to /var/cache/conftool/dbconfig/20240528-155309-arnaudb.json [15:53:15] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [15:53:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1206.eqiad.wmnet with reason: reimage [15:53:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1206.eqiad.wmnet with reason: reimage [15:53:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on kubernetes1037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:54:36] 10ops-codfw, 06DC-Ops: Relabel kubernetes2032 to wikikube-worker2002 - https://phabricator.wikimedia.org/T366085 (10hnowlan) 03NEW [15:55:08] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:55:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1206.eqiad.wmnet with OS bookworm [15:58:16] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2663/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035468 (https://phabricator.wikimedia.org/T364965) (owner: 10Lucas Werkmeister (WMDE)) [15:58:57] FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1600). [16:00:05] Lucas_WMDE, dduvall, and brett: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:39] o/ [16:00:41] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:00:45] (but idk if we’re in any shape to deploy anyways…) [16:00:50] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:01:23] Lucas_WMDE: hi! pcc looks good, I'm ICing this other thing but yours seems simple enough to do in parallel, hang on just a sec [16:01:40] ok! [16:01:48] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1036633|Create electionadmin group on testwiki (T209892)]] (duration: 17m 48s) [16:01:53] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [16:02:23] dduvall, brett: I didn't expect to see a traffic sre put an LVS change in the puppet window, brett are you intending to deploy that? [16:02:43] no is fine, just want to get on the same page :) [16:02:58] (03CR) 10Bking: [C:03+1] install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman) [16:03:07] brett is out today [16:03:16] oh! okay [16:03:24] I think the reason it was pushed to the window was Monday was a holiday in the US [16:03:32] I can take it today but I don't think it needs to be in this window fwiw [16:03:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T364299)', diff saved to https://phabricator.wikimedia.org/P63451 and previous config saved to /var/cache/conftool/dbconfig/20240528-160337-marostegui.json [16:03:45] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:03:53] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036695 (https://phabricator.wikimedia.org/T349774) [16:03:57] rzl: feel free to skip and I can sync up with dduvall later [16:03:59] sukhe: if you're willing to take it, yeah I'd appreciate that [16:04:01] thank you <3 [16:04:20] (03CR) 10Ladsgroup: [C:03+2] x-wikimedia-debug: add datacenter options for k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [16:04:24] np! dduvall please ping me when you are online :) [16:04:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [16:04:38] Lucas_WMDE: going ahead with yours, will you want to test anything manually? [16:05:01] (03Merged) 10jenkins-bot: x-wikimedia-debug: add datacenter options for k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [16:05:13] (03CR) 10RLazarus: [V:03+1 C:03+2] Remove statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/1035468 (https://phabricator.wikimedia.org/T364965) (owner: 10Lucas Werkmeister (WMDE)) [16:05:32] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1035361|x-wikimedia-debug: add datacenter options for k8s (T365478)]] [16:05:37] T365478: XWD: Allow choosing datacentre in k8s-mwdebug - https://phabricator.wikimedia.org/T365478 [16:05:42] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036695 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [16:06:00] (03CR) 10Muehlenhoff: "Sure, it's now inline" [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [16:06:01] rzl: don’t think so, no [16:06:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63452 and previous config saved to /var/cache/conftool/dbconfig/20240528-160635-arnaudb.json [16:06:48] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:52] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036695 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [16:06:52] IIUC I shouldn’t see a change anyway, since Puppet won’t remove the git clone either [16:06:53] Lucas_WMDE: cool -- also note removing those directories from Puppet doesn't delete them on-host, just makes them unmanaged -- do you also want them gone? [16:06:56] yep [16:07:11] yeah, I had that as a comment on the change, not sure if you saw [16:07:28] I probably don’t have permission to remove the dir manually, but let me see [16:07:43] ah sorry, moving too fast :) two stages would work but I can also just rm it [16:07:52] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:07:56] nope, looks like I should be able to rm it even [16:08:00] as it’s owned by analytics-wmde [16:08:05] and I seem to be able to sudo to that user [16:08:06] !log ladsgroup@deploy1002 ladsgroup and jiji: Backport for [[gerrit:1035361|x-wikimedia-debug: add datacenter options for k8s (T365478)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:08:09] !log ladsgroup@deploy1002 ladsgroup and jiji: Continuing with sync [16:08:25] rzl: should I just try that rm (in ~30 mins or so) and !log it? [16:08:26] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:08:42] Lucas_WMDE: sure, or I can run puppet manually to save you the wait, one sec :) [16:08:49] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:08:50] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:09:16] ok ^^ [16:09:31] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:09:32] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:09:51] actually, I guess that’s not needed? regardless of when the next puppet run is, it will run on a config that doesn’t recreate the clone [16:09:56] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:10:01] it’s not like puppet constantly runs but only updates its config every 30 mins [16:10:04] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:10:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage [16:10:11] yeah unless you get unlucky and it's running at the same time [16:11:18] !log hnowlan@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2002.codfw.wmnet [16:11:48] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:13] Lucas_WMDE: done, go ahead [16:12:21] !log kubectl node uncordon wikikube-worker2002.codfw.wmnet [16:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage [16:13:15] thanks! [16:13:28] (for future reference -- either the two-commit version or the manual-delete version works fine -- using two commits is the Classically Correct Way Of Doing Things, and it's easier when there's more than one host, but cleaning up manually is fine too as long as you remember to do it) [16:14:49] !log lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde rm -rf /srv/analytics-wmde/wdcm/ # T364965; contained src/ as a clean git clone as of c2b0a324e9 / I024691a148, and nothing else [16:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:54] T364965: stat1007 to stat1011 migration pipeline output check - https://phabricator.wikimedia.org/T364965 [16:14:58] yeah, this is only on two hosts AFAIK [16:15:05] I’ll do it on stat1007 just in case (though AFAIK that’s going away soon) [16:16:55] or not https://phabricator.wikimedia.org/T364965#9838579 :) [16:17:03] anyway, I think I’m done then. thanks rzl! [16:17:13] thanks! [16:17:19] (puppet window is done) [16:17:32] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1035361|x-wikimedia-debug: add datacenter options for k8s (T365478)]] (duration: 12m 00s) [16:17:40] T365478: XWD: Allow choosing datacentre in k8s-mwdebug - https://phabricator.wikimedia.org/T365478 [16:18:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P63453 and previous config saved to /var/cache/conftool/dbconfig/20240528-161845-marostegui.json [16:21:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63454 and previous config saved to /var/cache/conftool/dbconfig/20240528-162141-arnaudb.json [16:22:50] sukhe: hey sorry about that [16:22:56] kiddo drop off [16:23:13] dduvall: no problem at all of course :) [16:23:21] happy to get started when you are around [16:23:31] we can move to -traffic [16:23:53] sounds good [16:23:57] * dduvall goes there [16:26:45] rzl: thanks for handling the puppet patches, I was in a meeting [16:26:52] no worries! [16:28:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:45] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:31:55] (03CR) 10CI reject: [V:04-1] service: Remove blubberoid from backend servers and load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:32:15] (03PS2) 10Ssingh: service: Remove blubberoid from backend servers and load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:33:52] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [16:33:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P63455 and previous config saved to /var/cache/conftool/dbconfig/20240528-163353-marostegui.json [16:33:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1206.eqiad.wmnet with OS bookworm [16:34:00] !log running run-puppet-agent on A:dnsbox [16:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:38] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2664/console" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:36:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63457 and previous config saved to /var/cache/conftool/dbconfig/20240528-163647-arnaudb.json [16:37:51] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [16:38:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1211', diff saved to https://phabricator.wikimedia.org/P63458 and previous config saved to /var/cache/conftool/dbconfig/20240528-163810-marostegui.json [16:38:51] (03PS1) 10Marostegui: db1211: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1036702 [16:39:06] (03CR) 10Klausman: [C:03+1] ml/etcd: remove obsolete certificites [puppet] - 10https://gerrit.wikimedia.org/r/1036619 (owner: 10Muehlenhoff) [16:39:20] (03CR) 10Marostegui: [C:03+2] db1211: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1036702 (owner: 10Marostegui) [16:39:56] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2665/console" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:40:00] (03PS1) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [16:40:20] (03CR) 10Ssingh: [V:03+2 C:03+2] "rebased, no code change. reviewed again." [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:41:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63459 and previous config saved to /var/cache/conftool/dbconfig/20240528-164106-arnaudb.json [16:41:12] !log cumin 'O:lvs::balancer' 'run-puppet-agent' [16:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:26] !log sudo cumin 'O:lvs::balancer' 'run-puppet-agent': T365742 [16:41:31] (03PS2) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [16:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:32] T365742: Remove blubberoid LVS/k8s service - https://phabricator.wikimedia.org/T365742 [16:41:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1211.eqiad.wmnet with OS bookworm [16:41:51] (03CR) 10Alexandros Kosiaris: [C:03+1] install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman) [16:43:58] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [16:44:57] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2666/console" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:45:53] 06SRE, 10SRE-tools, 07SRE-Unowned, 06Infrastructure-Foundations: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9838782 (10Dzahn) I was only asking because I was clinic duty and trying to triage things tagged "Unowned". But it can stay this... [16:46:07] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:46:12] ^ expected [16:46:23] dduvall and I are removing blubberoid [16:46:47] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:47:13] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-secondary-codfw and A:lvs (T365742) [16:47:18] T365742: Remove blubberoid LVS/k8s service - https://phabricator.wikimedia.org/T365742 [16:48:08] (03CR) 10CI reject: [V:04-1] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:48:57] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:49:01] ^ expected [16:49:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T364299)', diff saved to https://phabricator.wikimedia.org/P63461 and previous config saved to /var/cache/conftool/dbconfig/20240528-164902-marostegui.json [16:49:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:49:08] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:49:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:49:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [16:49:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [16:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T364069)', diff saved to https://phabricator.wikimedia.org/P63462 and previous config saved to /var/cache/conftool/dbconfig/20240528-165002-marostegui.json [16:50:07] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:51:37] (03CR) 10Xcollazo: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036626 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [16:52:16] (03PS3) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) [16:53:57] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage [16:56:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63463 and previous config saved to /var/cache/conftool/dbconfig/20240528-165612-arnaudb.json [16:57:33] !log sudo cumin 'A:lvs-low-traffic-eqiad or A:lvs-low-traffic-codfw' 'systemctl restart pybal.serice' [16:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:56] (03CR) 10CI reject: [V:04-1] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:57:57] (03PS1) 10Marostegui: Revert "db1211: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1036657 [16:58:15] 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094 (10hnowlan) 03NEW [16:59:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage [16:59:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-secondary-codfw and A:lvs (T365742) [16:59:34] T365742: Remove blubberoid LVS/k8s service - https://phabricator.wikimedia.org/T365742 [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1700) [17:02:24] (03PS6) 10Jdlrobson: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [17:02:43] (03PS7) 10Jdlrobson: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [17:02:51] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [17:03:29] !log removing blubberoid's IP from ipvsadm: T365742 [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:46] (03PS8) 10Jdlrobson: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [17:08:05] !log sudo cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsadm --delete-service --tcp-service 10.2.2.31:4666': T365742 [17:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:14] T365742: Remove blubberoid LVS/k8s service - https://phabricator.wikimedia.org/T365742 [17:08:35] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:08:55] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:09:05] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet [17:09:05] (03PS1) 10CDanis: temporarily disable otelcol @ eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036707 (https://phabricator.wikimedia.org/T366094) [17:09:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet [17:09:13] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2014.codfw.wmnet [17:09:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2014.codfw.wmnet [17:09:49] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:10:01] cool [17:10:46] (03PS2) 10Ssingh: service: Remove blubberoid from service catalog and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1035797 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:11:12] (03CR) 10Ssingh: [V:03+2 C:03+2] "rebase, no code change. reviewed." [puppet] - 10https://gerrit.wikimedia.org/r/1035797 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:11:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63464 and previous config saved to /var/cache/conftool/dbconfig/20240528-171119-arnaudb.json [17:12:05] (03PS2) 10Ssingh: service: Remove remaining blubberoid related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1035798 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:12:54] (03CR) 10RLazarus: [C:03+1] temporarily disable otelcol @ eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036707 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [17:14:01] (03PS1) 10CDanis: otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) [17:14:05] dancy, andre: hi train deployers, oncall here :) fyi we were working an incident with the kubernetes control plane earlier and we believe it's under control; you may see a little extra slowness during scap deploys but we expect them to complete, if you get *errors* please ping me and/or cdanis [17:14:08] (03PS3) 10Bking: dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) [17:14:24] I have a meeting during the train window but I'll have one eye on it anyway if I can [17:14:41] (03PS2) 10CDanis: temporarily disable otelcol @ eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036707 (https://phabricator.wikimedia.org/T366094) [17:14:55] (03CR) 10CDanis: [C:03+2] "already deployed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036707 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [17:15:10] rzl: Got it. Thanks! [17:15:36] (03CR) 10Ssingh: [C:03+2] service: Remove remaining blubberoid related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1035798 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:15:51] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [17:16:47] (03PS2) 10Dzahn: lists/stewards: add timer to run mailman syncmembers for stewards-l [puppet] - 10https://gerrit.wikimedia.org/r/1034137 (https://phabricator.wikimedia.org/T351202) [17:17:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9838998 (10Papaul) [17:18:02] (03CR) 10Hnowlan: [C:03+1] otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [17:18:16] (03Merged) 10jenkins-bot: temporarily disable otelcol @ eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036707 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [17:18:57] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:11] (03CR) 10RLazarus: [C:03+1] "$ for i in 10.64.16.93 10.64.48.45 10.64.32.37; do host $i; done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis) [17:21:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1211.eqiad.wmnet with OS bookworm [17:21:29] !log removing blubberoid from staging, `helmfile -e staging destroy` (T365742) [17:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:34] T365742: Remove blubberoid LVS/k8s service - https://phabricator.wikimedia.org/T365742 [17:22:51] (03PS1) 10Ladsgroup: Set zhwiki to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036710 (https://phabricator.wikimedia.org/T351237) [17:23:32] (03CR) 10Jcrespo: [C:03+1] "Good from data persistence team side. I may need to tune some descriptions, I noticed, but unrelated to this patch." [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff) [17:24:13] !log removing blubberoid from codfw, `helmfile -e codfw destroy` (T365742) [17:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:18] !log sudo -i puppet cert clean blubberoid.discovery.wmnet: T365742 [17:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:27] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1056 for ban highly-loaded node - bking@cumin2002 [17:25:05] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1056 for ban highly-loaded node - bking@cumin2002 [17:25:15] !log removing blubberoid from eqiad, `helmfile -e eqiad destroy` (T365742) [17:25:17] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1056.eqiad.wmnet for ban highly-loaded node - bking@cumin2002 [17:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:20] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1056.eqiad.wmnet for ban highly-loaded node - bking@cumin2002 [17:25:52] jouncebot: nowandnext [17:25:53] For the next 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1700) [17:25:53] In 0 hour(s) and 34 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1800) [17:26:05] (03CR) 10Marostegui: [C:03+2] Revert "db1211: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1036657 (owner: 10Marostegui) [17:26:16] (03CR) 10Ladsgroup: [C:03+2] Set zhwiki to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036710 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [17:26:23] (03PS1) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) [17:26:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63466 and previous config saved to /var/cache/conftool/dbconfig/20240528-172625-arnaudb.json [17:26:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036710 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [17:26:53] (03Merged) 10jenkins-bot: Set zhwiki to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036710 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [17:27:24] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1036710|Set zhwiki to read new for pagelinks migration (T351237)]] [17:27:31] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [17:27:31] (03PS1) 10Santiago Faci: device-analytics: Downgrading to a previous version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036712 (https://phabricator.wikimedia.org/T360524) [17:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63467 and previous config saved to /var/cache/conftool/dbconfig/20240528-172942-root.json [17:30:08] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1036710|Set zhwiki to read new for pagelinks migration (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:30:25] (03PS2) 10Santiago Faci: device-analytics: Downgrading to a previous version for a staging test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036712 (https://phabricator.wikimedia.org/T360524) [17:30:38] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [17:30:53] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [17:31:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:11] (03CR) 10Dzahn: [C:03+2] "testing this, so far it's a dry-run by using the -n parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1034137 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:31:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:18] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2667/console" [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:35:08] (03CR) 10Santiago Faci: [C:03+2] device-analytics: Downgrading to a previous version for a staging test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036712 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [17:35:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:46] (03PS1) 10Dduvall: admin_ng: remove blubberoid namespace and helmfile.d files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) [17:36:08] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [17:36:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:36:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:36:57] (03Merged) 10jenkins-bot: device-analytics: Downgrading to a previous version for a staging test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036712 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci) [17:36:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:39:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1036710|Set zhwiki to read new for pagelinks migration (T351237)]] (duration: 11m 48s) [17:39:18] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [17:40:12] (03PS2) 10Dduvall: admin_ng: remove blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) [17:41:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:41:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:41:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63468 and previous config saved to /var/cache/conftool/dbconfig/20240528-174131-arnaudb.json [17:42:45] (03CR) 10Dduvall: "@effie I wasn't sure if I should remove the chart definition in a follow up. Just let me know if so, and I will refactor." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:42:49] (03CR) 10Ssingh: [C:03+1] admin_ng: remove blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:43:37] (03CR) 10Ssingh: [C:03+1] "Looks good overall, at least per the instructions. Maybe someone more experienced can also help review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:44:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63469 and previous config saved to /var/cache/conftool/dbconfig/20240528-174448-root.json [17:46:50] (03PS1) 10Dzahn: lists::automation: don't try to write to logfile from command [puppet] - 10https://gerrit.wikimedia.org/r/1036719 (https://phabricator.wikimedia.org/T351202) [17:47:28] (03CR) 10Dzahn: [C:03+2] lists::automation: don't try to write to logfile from command [puppet] - 10https://gerrit.wikimedia.org/r/1036719 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:47:52] RESOLVED: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:48:01] (03PS1) 10Jdlrobson: [beta] Night mode options on desktop labs should match mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036720 [17:48:12] (03CR) 10CI reject: [V:04-1] [beta] Night mode options on desktop labs should match mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036720 (owner: 10Jdlrobson) [17:48:17] (03PS5) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [17:49:15] (03PS2) 10Jdlrobson: [beta] Night mode options on desktop labs should match mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036720 [17:51:01] (03PS2) 10Dzahn: lists::automation: don't try to write to logfile from command [puppet] - 10https://gerrit.wikimedia.org/r/1036719 (https://phabricator.wikimedia.org/T351202) [17:51:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:36] (03CR) 10Jdrewniak: [C:03+2] [beta] Night mode options on desktop labs should match mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036720 (owner: 10Jdlrobson) [17:52:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9839193 (10Papaul) [17:53:32] (03CR) 10Dzahn: [V:03+2 C:03+2] lists::automation: don't try to write to logfile from command [puppet] - 10https://gerrit.wikimedia.org/r/1036719 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [17:53:42] (03Merged) 10jenkins-bot: [beta] Night mode options on desktop labs should match mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036720 (owner: 10Jdlrobson) [17:53:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9839201 (10Papaul) [17:55:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63470 and previous config saved to /var/cache/conftool/dbconfig/20240528-175638-arnaudb.json [17:56:59] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [17:57:52] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:58:33] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [17:58:43] (03PS1) 10Dzahn: lists::automation: fix heredoc syntax, remove double quotes [puppet] - 10https://gerrit.wikimedia.org/r/1036722 (https://phabricator.wikimedia.org/T351202) [17:58:52] (03CR) 10AOkoth: [C:03+2] Filter out addresses that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [17:59:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63471 and previous config saved to /var/cache/conftool/dbconfig/20240528-175954-root.json [18:00:05] dancy and andre: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T1800). [18:01:01] o/ [18:01:36] (03PS1) 10Ebernhardson: cirrus: Move remaining public writes to SUP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036723 (https://phabricator.wikimedia.org/T363475) [18:02:14] o/ [18:03:05] (03CR) 10Dzahn: [C:03+2] lists::automation: fix heredoc syntax, remove double quotes [puppet] - 10https://gerrit.wikimedia.org/r/1036722 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:11:05] (03PS1) 10Ahmon Dancy: Remove the php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036725 (https://phabricator.wikimedia.org/T359643) [18:13:28] (03CR) 10BryanDavis: [C:03+1] Remove the php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036725 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:14:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036725 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:14:59] (03PS1) 10Pppery: Add Phabricator antivandalism extension to Phabricator translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036726 (https://phabricator.wikimedia.org/T365858) [18:15:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P63472 and previous config saved to /var/cache/conftool/dbconfig/20240528-181503-root.json [18:15:22] (03Merged) 10jenkins-bot: Remove the php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036725 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:16:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [18:16:53] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1036725|Remove the php symlink (T359643)]] [18:16:58] T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643 [18:17:29] (03CR) 10Ssingh: purged: set use_pki to true for cp6001 in drmrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:17:35] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:17:54] !log dancy@deploy1002 sync-world aborted: Backport for [[gerrit:1036725|Remove the php symlink (T359643)]] (duration: 01m 00s) [18:18:22] (03CR) 10CDobbins: [V:03+1] purged: set use_pki to true for cp6001 in drmrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:18:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9839289 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [18:18:50] (03CR) 10CDobbins: [V:03+1] purged: set use_pki to true for cp6001 in drmrs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:19:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:19:35] (03PS1) 10Dzahn: lists::automation: double quote end text to enable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1036727 (https://phabricator.wikimedia.org/T351202) [18:19:40] !log dancy@deploy1002 Started scap: Backport for [[gerrit:1036725|Remove the php symlink (T359643)]] [18:19:46] (03CR) 10CI reject: [V:04-1] lists::automation: double quote end text to enable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1036727 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:20:11] !log dancy@deploy1002 sync-world aborted: Backport for [[gerrit:1036725|Remove the php symlink (T359643)]] (duration: 00m 30s) [18:20:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:20:40] (03PS2) 10Dzahn: lists::automation: double quote end text to enable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1036727 (https://phabricator.wikimedia.org/T351202) [18:21:00] (03CR) 10CDobbins: [V:03+1 C:03+2] purged: set use_pki to true for cp6001 in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:21:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:22:52] (03PS2) 10Pppery: Add Phabricator antivandalism extension to Phabricator translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1036726 (https://phabricator.wikimedia.org/T365858) [18:24:54] (03CR) 10Dzahn: [C:03+2] lists::automation: double quote end text to enable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1036727 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:30:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P63473 and previous config saved to /var/cache/conftool/dbconfig/20240528-183009-root.json [18:31:40] (03PS1) 10Ahmon Dancy: Revert "Remove the php symlink" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036658 (https://phabricator.wikimedia.org/T359643) [18:32:04] (03CR) 10Ahmon Dancy: [C:03+2] Revert "Remove the php symlink" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036658 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:32:43] (03Merged) 10jenkins-bot: Revert "Remove the php symlink" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036658 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [18:33:44] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036729 (https://phabricator.wikimedia.org/T361401) [18:33:46] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036729 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [18:34:25] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036729 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot) [18:34:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9839375 (10Jclark-ctr) @akosiaris still failing for same issue for kafka-main1010 [18:35:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:36:20] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102 (10RobH) 03NEW p:05Triage→03Medium [18:36:29] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9839398 (10RobH) [18:36:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:37:34] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9839422 (10RobH) [18:37:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:38:05] (03PS1) 10Dzahn: lists::automation: add missing spaces before line breaks [puppet] - 10https://gerrit.wikimedia.org/r/1036731 (https://phabricator.wikimedia.org/T351202) [18:39:02] (03PS2) 10Dzahn: lists::automation: add missing spaces before line breaks [puppet] - 10https://gerrit.wikimedia.org/r/1036731 (https://phabricator.wikimedia.org/T351202) [18:39:50] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9839429 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Drive [18:40:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:41:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:41:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T364299)', diff saved to https://phabricator.wikimedia.org/P63474 and previous config saved to /var/cache/conftool/dbconfig/20240528-184110-marostegui.json [18:41:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:43:38] rzl: `Finished sync-prod-k8s (duration: 03m 03s)` That's way faster than before! [18:43:47] \o/ [18:44:06] Huge win! [18:44:45] cdanis: ^ fascinating [18:44:54] fascinating [18:45:14] dancy: what was typical before? [18:45:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P63475 and previous config saved to /var/cache/conftool/dbconfig/20240528-184515-root.json [18:45:20] like.. 16 minutes? [18:45:51] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.7 refs T361401 [18:45:55] T361401: 1.43.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T361401 [18:46:04] sheesh ok [18:46:34] (03CR) 10Bking: dse-k8s: add new airflow service to k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [18:47:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [18:49:11] whou, congrats [18:49:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9839469 (10Dzahn) @akosiaris re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035769/1/modules/profile/data/profile/installserver/preseed.yaml I th... [18:50:18] rzl/cdanis/andre: Correction. ~16 minutes was typical for a full backport operation. Looking at recent logs, sync-prod-k8s was taking ~7 minutes [18:50:27] thanks [18:50:33] (03CR) 10Dzahn: [C:03+2] lists::automation: add missing spaces before line breaks [puppet] - 10https://gerrit.wikimedia.org/r/1036731 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:50:39] (03PS3) 10Dzahn: lists::automation: add missing spaces before line breaks [puppet] - 10https://gerrit.wikimedia.org/r/1036731 (https://phabricator.wikimedia.org/T351202) [18:50:49] Still an excellent improvement. [18:50:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [18:50:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9839473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [18:51:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:51:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:53:01] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on elastic1056.eqiad.wmnet with reason: rebooting after abnormally high load [18:53:16] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic1056.eqiad.wmnet with reason: rebooting after abnormally high load [18:54:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:54:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:54:44] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1056.eqiad.wmnet [18:55:39] (03PS1) 10Dzahn: installserver: add partman for kafka-main1010 [puppet] - 10https://gerrit.wikimedia.org/r/1036733 (https://phabricator.wikimedia.org/T363212) [18:57:46] (03CR) 10Dzahn: "it's bash globbing where we don't have the round brackets. since this was reverted following up with https://gerrit.wikimedia.org/r/c/oper" [puppet] - 10https://gerrit.wikimedia.org/r/1035769 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [18:59:25] (03CR) 10Dzahn: "fixing for kafka-main1010: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036733" [puppet] - 10https://gerrit.wikimedia.org/r/1036375 (https://phabricator.wikimedia.org/T363212) (owner: 10Ayounsi) [18:59:52] (03CR) 10Dzahn: [C:03+2] installserver: add partman for kafka-main1010 [puppet] - 10https://gerrit.wikimedia.org/r/1036733 (https://phabricator.wikimedia.org/T363212) (owner: 10Dzahn) [19:00:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P63476 and previous config saved to /var/cache/conftool/dbconfig/20240528-190021-root.json [19:01:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1056.eqiad.wmnet [19:17:18] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036737 (https://phabricator.wikimedia.org/T128546) [19:18:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:19:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:19:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [19:19:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9839652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet... [19:20:11] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:20:15] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:20:15] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:20:25] PROBLEM - grafana.wikimedia.org on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [19:20:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:20:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:22:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:22:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:23:05] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:23:05] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:24:20] !log ganeti1027:~$ sudo gnt-instance reboot grafana1002 [19:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:24:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:24:55] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:24:55] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:25:07] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:25:09] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 16 Jun 2024 04:13:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:25:09] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 16 Jun 2024 04:13:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:25:19] RECOVERY - grafana.wikimedia.org on grafana1002 is OK: HTTP OK: HTTP/1.1 200 OK - 134230 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [19:25:44] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Update SPF records as needed - https://phabricator.wikimedia.org/T366113 (10jhathaway) 03NEW [19:25:59] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Update SPF records as needed - https://phabricator.wikimedia.org/T366113#9839708 (10jhathaway) p:05Triage→03Medium [19:26:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:26:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:28:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:28:41] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:29:08] (03CR) 10Dzahn: lists::automation: add missing spaces before line breaks [puppet] - 10https://gerrit.wikimedia.org/r/1036731 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [19:30:03] !log disable swap on grafana1002 [19:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:09] (03CR) 10Dzahn: Filter out addresses that cannot be removed from VRTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [19:31:56] (03CR) 10VolkerE: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [19:32:48] (03PS1) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) [19:33:40] (03CR) 10CI reject: [V:04-1] spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [19:36:03] (03PS2) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) [19:36:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1009.eqiad.wmnet with OS bullseye [19:36:56] (03CR) 10CI reject: [V:04-1] spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [19:37:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9839744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye executed... [19:38:29] (03PS3) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) [19:40:02] (03CR) 10Dzahn: [V:03+2 C:03+2] lists::automation: add missing spaces before line breaks [puppet] - 10https://gerrit.wikimedia.org/r/1036731 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [19:43:29] (03CR) 10Volans: "Long overdue, sorry for the late review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [19:45:42] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [19:45:47] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [19:49:27] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240528T2000). [20:00:05] Jdlrobson, ebernhardson, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:16] o/ [20:01:48] hi - i can deploy [20:02:35] (03PS9) 10Jdlrobson: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [20:02:37] cjming: I can deploy mine after your done [20:02:49] jan_drewniak: great! i'll ping when i'm finished [20:03:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [20:04:27] (03Merged) 10jenkins-bot: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [20:04:59] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1034480|deploy(Popups): Make use of conditional user defaults (T364347)]] [20:05:04] T364347: Popups: Make use of conditional user defaults - https://phabricator.wikimedia.org/T364347 [20:05:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main1010.eqiad.wmnet with OS bullseye [20:07:39] !log cjming@deploy1002 mabualruz and cjming: Backport for [[gerrit:1034480|deploy(Popups): Make use of conditional user defaults (T364347)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:07:49] Jdlrobson: shall i sync? [20:07:56] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [20:08:12] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [20:09:15] Jdlrobson: are you able to test? [20:10:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:11:20] cjming: on it [20:11:45] cjming: LGTM [20:11:53] great - syncing [20:11:55] !log cjming@deploy1002 mabualruz and cjming: Continuing with sync [20:13:51] cjming: \o [20:14:02] * ebernhardson didn't completely lose track of time today, almost :P [20:14:36] lol - hi ebernhardson -- your patch is queued up next - should i go ahead and sync when ready? [20:15:02] cjming: yup. It will log spam a little on initial deploy, but should quiet shortly after. [20:15:15] roger that [20:15:35] (03PS2) 10Ebernhardson: cirrus: Move remaining public writes to SUP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036723 (https://phabricator.wikimedia.org/T363475) [20:17:23] (03PS1) 10Reedy: hieradata/mediawiki.yaml: Move foundation.wm.o to wm.o docroot folder [puppet] - 10https://gerrit.wikimedia.org/r/1036744 (https://phabricator.wikimedia.org/T366005) [20:19:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364299)', diff saved to https://phabricator.wikimedia.org/P63478 and previous config saved to /var/cache/conftool/dbconfig/20240528-201945-marostegui.json [20:19:53] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:20:51] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1034480|deploy(Popups): Make use of conditional user defaults (T364347)]] (duration: 15m 52s) [20:20:59] T364347: Popups: Make use of conditional user defaults - https://phabricator.wikimedia.org/T364347 [20:21:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036723 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [20:21:20] Jdlrobson: should be live! [20:21:51] (03Merged) 10jenkins-bot: cirrus: Move remaining public writes to SUP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036723 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [20:22:20] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1036723|cirrus: Move remaining public writes to SUP (T363475)]] [20:22:25] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:25:26] !log cjming@deploy1002 cjming and ebernhardson: Backport for [[gerrit:1036723|cirrus: Move remaining public writes to SUP (T363475)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:25:47] !log cjming@deploy1002 cjming and ebernhardson: Continuing with sync [20:26:24] thx cjming ! [20:28:37] yw! [20:34:32] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1036723|cirrus: Move remaining public writes to SUP (T363475)]] (duration: 12m 11s) [20:34:36] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:34:47] ebernhardson: your changes are live! [20:34:52] jan_drewniak: all yours [20:34:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P63479 and previous config saved to /var/cache/conftool/dbconfig/20240528-203453-marostegui.json [20:35:22] cjming: awesome, looking it over [20:36:44] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036737 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:37:24] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036737 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:42:03] (03PS1) 10Ebernhardson: cirrus: Restrict saneitizer to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1036749 [20:45:11] PROBLEM - Check whether ferm is active by checking the default input chain on parse1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:45:25] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:49:37] 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9839951 (10CDanis) Posting a short comment now, before I start drafting a much longer comment (and possibly don't finish before my toddler ends my day): I believe the situation to be stable for no... [20:50:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P63481 and previous config saved to /var/cache/conftool/dbconfig/20240528-205001-marostegui.json [20:51:36] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1036737| Bumping portals to master (T128546)]] (duration: 11m 23s) [20:51:41] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:56:48] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:33] (03PS1) 10Ahmon Dancy: Remove the php symlink (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036750 (https://phabricator.wikimedia.org/T359643) [21:01:58] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1036737| Bumping portals to master (T128546)]] (duration: 10m 21s) [21:02:00] (03CR) 10BryanDavis: [C:03+1] Remove the php symlink (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036750 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy) [21:02:03] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364299)', diff saved to https://phabricator.wikimedia.org/P63482 and previous config saved to /var/cache/conftool/dbconfig/20240528-210510-marostegui.json [21:05:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [21:05:16] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:05:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [21:05:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T364299)', diff saved to https://phabricator.wikimedia.org/P63483 and previous config saved to /var/cache/conftool/dbconfig/20240528-210533-marostegui.json [21:11:04] (03PS1) 10Ebernhardson: cirrus: Remove update rate alert [puppet] - 10https://gerrit.wikimedia.org/r/1036752 [21:13:14] (03CR) 10Bking: [C:03+2] cirrus: Restrict saneitizer to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1036749 (owner: 10Ebernhardson) [21:13:18] (03CR) 10Bking: [C:03+2] cirrus: Remove update rate alert [puppet] - 10https://gerrit.wikimedia.org/r/1036752 (owner: 10Ebernhardson) [21:15:11] RECOVERY - Check whether ferm is active by checking the default input chain on parse1011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:15:25] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:20:03] (03PS1) 10Ebernhardson: cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 [21:20:54] (03PS1) 10Andrew Bogott: wmcs-backup: call backy 'cleanup' after removing expired backups [puppet] - 10https://gerrit.wikimedia.org/r/1036755 (https://phabricator.wikimedia.org/T366071) [21:21:44] (03CR) 10CI reject: [V:04-1] cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:21:48] (03CR) 10Andrew Bogott: [C:03+1] horizon: remove openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [21:22:17] (03CR) 10Andrew Bogott: [C:03+1] hieradata: disable agent forwarding in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [21:23:25] (03CR) 10Ebernhardson: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:23:45] PROBLEM - MediaWiki CirrusSearch update rate - eqiad on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:26:05] (03CR) 10Bking: [C:03+1] cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:26:15] (03PS2) 10Ebernhardson: cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 [21:27:10] (03CR) 10Andrew Bogott: [C:03+2] wmcs-backup: call backy 'cleanup' after removing expired backups [puppet] - 10https://gerrit.wikimedia.org/r/1036755 (https://phabricator.wikimedia.org/T366071) (owner: 10Andrew Bogott) [21:27:26] (03CR) 10CI reject: [V:04-1] cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:30:26] (03PS3) 10Ebernhardson: cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 [21:31:37] (03CR) 10CI reject: [V:04-1] cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:32:49] (03PS9) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) [21:33:38] (03CR) 10Brennen Bearnes: "This probably still needs some tweaking, but I could use a gut check on the approach." [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [21:34:44] (03PS4) 10Ebernhardson: cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 [21:35:54] (03CR) 10CI reject: [V:04-1] cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:40:49] (03PS5) 10Ebernhardson: cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 [21:51:04] (03CR) 10Bking: [C:03+2] cirrus: Port update rate alert from puppet [alerts] - 10https://gerrit.wikimedia.org/r/1036754 (owner: 10Ebernhardson) [21:58:07] FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:01:19] (03PS1) 10Dzahn: vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) [22:03:07] (03CR) 10Dwisehaupt: [C:03+1] "Looks good from our side. Verified that 74.121.51.111 is still the correct address for the Acoustic mailings." [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [22:09:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364299)', diff saved to https://phabricator.wikimedia.org/P63484 and previous config saved to /var/cache/conftool/dbconfig/20240528-220950-marostegui.json [22:10:02] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:11:08] (03PS3) 10Cwhite: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [22:11:38] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9840297 (10Dzahn) @Urbanecm We now have another timer on l... [22:13:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9840295 (10colewhite) [22:16:44] (03CR) 10EoghanGaffney: spf recs update: phabricator, gitlab, wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [22:18:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9840315 (10colewhite) a:05MareikeHeuerWMDE→03colewhite Added Data Engineering tag for provisioning... [22:18:17] (03PS4) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) [22:18:50] (03PS1) 10Dzahn: contint: enable zuul-merger daemon on contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/1036762 (https://phabricator.wikimedia.org/T334517) [22:19:27] (03CR) 10Dzahn: "Antoine, this is re your comment that we are missing "running the secondary zuul-merger daemon (and its companion git-daemon)". Is this w" [puppet] - 10https://gerrit.wikimedia.org/r/1036762 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [22:19:46] (03CR) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [22:21:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9840330 (10colewhite) [22:24:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9840331 (10colewhite) [22:25:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P63485 and previous config saved to /var/cache/conftool/dbconfig/20240528-222500-marostegui.json [22:25:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9840335 (10colewhite) a:05RickiJay-WMDE→03colewhite Pinging one of @odimitrijevic, @Milimetric, @WDoranWMF, @Ahoelzl for Analytics team approval. [22:28:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9840337 (10colewhite) [22:29:21] (03CR) 10EoghanGaffney: [C:03+1] "+1 from me for phabricator/gitlab." [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway) [22:29:51] (03PS1) 10Cwhite: admin: add rickijay to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036592 (https://phabricator.wikimedia.org/T365574) [22:30:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:30:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:31:46] (03PS1) 10JHathaway: rsyslog: include slashes in program names [puppet] - 10https://gerrit.wikimedia.org/r/1036763 [22:32:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9840343 (10colewhite) Pinging one of @odimitrijevic, @Milimetric, @WDoranWMF, @Ahoelzl for Analytics team approval. [22:32:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9840345 (10colewhite) a:03colewhite [22:32:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:32:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:32:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway) [22:34:48] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:34:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:35:58] (03PS1) 10Jdlrobson: feature(Popups): Conditional User Defaults Implementation [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036664 (https://phabricator.wikimedia.org/T364347) [22:37:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:37:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:37:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9840361 (10colewhite) [22:39:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:39:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:39:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9840364 (10colewhite) Pinging one of @odimitrijevic, @Milimetric, @WDoranWMF, @Ahoelzl for Analytics team approval. [22:40:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P63486 and previous config saved to /var/cache/conftool/dbconfig/20240528-224008-marostegui.json [22:40:35] (03PS1) 10Cwhite: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) [22:41:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:41:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:43:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:43:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:44:12] (03PS1) 10Cwhite: admin: add mvolz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036594 (https://phabricator.wikimedia.org/T366088) [22:44:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Mvolz - https://phabricator.wikimedia.org/T366088#9840374 (10colewhite) [22:45:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:45:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:45:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Mvolz - https://phabricator.wikimedia.org/T366088#9840373 (10colewhite) Pinging one of @odimitrijevic, @Milimetric, @WDoranWMF, @Ahoelzl for Analytics team approval. [22:48:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:48:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:50:04] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9840399 (10colewhite) [22:50:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:50:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:51:09] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9840401 (10colewhite) @KFrancis Hi! Sohom is going to need an NDA on file for access to Logstash. Thank you! [22:54:12] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [22:55:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364299)', diff saved to https://phabricator.wikimedia.org/P63487 and previous config saved to /var/cache/conftool/dbconfig/20240528-225516-marostegui.json [22:55:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [22:55:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:55:30] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9840412 (10colewhite) [22:55:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [22:55:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T364299)', diff saved to https://phabricator.wikimedia.org/P63488 and previous config saved to /var/cache/conftool/dbconfig/20240528-225541-marostegui.json [22:55:48] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [22:59:16] PROBLEM - CirrusSearch comp_suggest eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [22:59:58] (03PS1) 10Cwhite: admin: add sperry-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036595 (https://phabricator.wikimedia.org/T365766) [23:05:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9840440 (10colewhite) @sonjaperry You'll want to read and sign L3 when you get a chance. Pinging one of @odimitrijevic, @Milimetric, @WDoranWMF, @Aho... [23:12:16] RECOVERY - CirrusSearch comp_suggest eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [23:21:45] (03CR) 10SBassett: hieradata/mediawiki.yaml: Move foundation.wm.o to wm.o docroot folder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036744 (https://phabricator.wikimedia.org/T366005) (owner: 10Reedy) [23:21:53] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9840492 (10KFrancis) Hello, please have Sohom send me their Name, mailing address, and email address, to kfrancis@wikimedia.org and I'll put the agreement together. Thanks! [23:24:48] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [23:26:14] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [23:26:35] (03PS1) 10Dzahn: devtools: rename hieradata host data, match new instance name [puppet] - 10https://gerrit.wikimedia.org/r/1036764 (https://phabricator.wikimedia.org/T363196) [23:28:57] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9840505 (10colewhite) a:03Soda [23:29:39] (03CR) 10Dzahn: [C:03+2] "gerrit-prod-1001 has been deleted - but shouldn't have been" [puppet] - 10https://gerrit.wikimedia.org/r/1036764 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [23:35:10] (03PS1) 10Dzahn: devtools: update IP for gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1036765 (https://phabricator.wikimedia.org/T363196) [23:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036596 [23:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036596 (owner: 10TrainBranchBot) [23:40:51] (03PS1) 10Jdlrobson: Limit responsive tables to .wikitables [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036665 (https://phabricator.wikimedia.org/T330527) [23:41:00] (03PS2) 10Jdlrobson: Limit responsive tables to .wikitables [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036665 (https://phabricator.wikimedia.org/T330527) [23:41:19] (03CR) 10Dzahn: [C:03+2] devtools: update IP for gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1036765 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [23:49:42] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:57:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364299)', diff saved to https://phabricator.wikimedia.org/P63489 and previous config saved to /var/cache/conftool/dbconfig/20240528-235755-marostegui.json [23:58:02] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299