[00:03:04] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10101491 (10KFrancis) Hi all, the NDA has been edited and sent for signatures. I'll confirm when it's complete. [00:07:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068175 (owner: 10TrainBranchBot) [00:09:52] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:10:14] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:10:16] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:10:28] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:10:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:12:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T370903)', diff saved to https://phabricator.wikimedia.org/P68110 and previous config saved to /var/cache/conftool/dbconfig/20240829-001215-ladsgroup.json [00:12:20] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:25:54] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:26:16] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:26:16] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:26:28] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:26:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:42:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68111 and previous config saved to /var/cache/conftool/dbconfig/20240829-004215-ladsgroup.json [00:42:20] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:57:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P68112 and previous config saved to /var/cache/conftool/dbconfig/20240829-005722-ladsgroup.json [01:07:24] (03PS1) 10Andrew Bogott: keystone::apache: include auth_openidc [puppet] - 10https://gerrit.wikimedia.org/r/1068260 (https://phabricator.wikimedia.org/T359590) [01:12:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P68113 and previous config saved to /var/cache/conftool/dbconfig/20240829-011229-ladsgroup.json [01:14:10] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:27:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68114 and previous config saved to /var/cache/conftool/dbconfig/20240829-012736-ladsgroup.json [01:27:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [01:27:41] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:27:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [01:27:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T371742)', diff saved to https://phabricator.wikimedia.org/P68115 and previous config saved to /var/cache/conftool/dbconfig/20240829-012759-ladsgroup.json [01:39:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:18] FIRING: KubernetesCalicoDown: mw2294.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2294.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:21] (03PS1) 10Ebrahim: Enable dark mode for all namespaces in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [02:55:30] (03PS2) 10Ebrahim: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [02:57:34] (03PS3) 10Ebrahim: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [03:00:36] (03PS4) 10Ebrahim: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [03:01:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:40] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:24] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:06:17] (03PS5) 10Ebrahim: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [03:07:04] (03PS6) 10Ebrahim: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [03:09:23] (03PS7) 10Ebrahim: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 [03:28:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T371742)', diff saved to https://phabricator.wikimedia.org/P68116 and previous config saved to /var/cache/conftool/dbconfig/20240829-032803-ladsgroup.json [03:28:08] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:36:34] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939#10101634 (10phaultfinder) [03:43:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P68117 and previous config saved to /var/cache/conftool/dbconfig/20240829-034310-ladsgroup.json [03:52:54] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 98 probes of 694 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:57:54] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 694 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:58:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P68118 and previous config saved to /var/cache/conftool/dbconfig/20240829-035817-ladsgroup.json [04:03:20] (03CR) 10Jdlrobson: [C:03+1] Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 (owner: 10Ebrahim) [04:13:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T371742)', diff saved to https://phabricator.wikimedia.org/P68119 and previous config saved to /var/cache/conftool/dbconfig/20240829-041326-ladsgroup.json [04:13:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [04:13:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:13:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [04:13:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68120 and previous config saved to /var/cache/conftool/dbconfig/20240829-041348-ladsgroup.json [04:49:14] (03PS4) 10KartikMistry: Enable Section Translation in bdr, btm, and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) [04:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:54:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [04:57:46] Amir1: How can I add automated commit to certain repository by Gerrit Maintainace Bot. We would like to add language in the list when new Wikipedia is created, similar to adding language in the cxserver/config/languages.yaml [05:14:10] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T0600) [06:00:05] marostegui, Amir1, and arnaudb: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T0600). [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:18] FIRING: KubernetesCalicoDown: mw2294.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2294.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68121 and previous config saved to /var/cache/conftool/dbconfig/20240829-061453-ladsgroup.json [06:15:00] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:17:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:29:57] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow [06:30:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P68122 and previous config saved to /var/cache/conftool/dbconfig/20240829-063000-ladsgroup.json [06:30:07] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow (duration: 00m 10s) [06:45:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P68123 and previous config saved to /var/cache/conftool/dbconfig/20240829-064508-ladsgroup.json [06:55:42] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@cb0bc4d]: (no justification provided) [06:55:46] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@cb0bc4d]: (no justification provided) (duration: 00m 03s) [07:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68124 and previous config saved to /var/cache/conftool/dbconfig/20240829-070017-ladsgroup.json [07:00:28] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:05:24] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:14:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:25:21] (03PS1) 10Klausman: preseed: Add ml-lab machines and dse-k8s-worker1009 [puppet] - 10https://gerrit.wikimedia.org/r/1068656 (https://phabricator.wikimedia.org/T372432) [07:30:42] (03PS1) 10Klausman: manifests: move new ML GPU hosts in eqiad from insetup to worker role [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) [07:31:15] (03CR) 10JMeybohm: "If this works and is good enough as a probe I think we can switch the chart default to `test_events: {}` indeed. I just wanted to make sur" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [07:31:19] (03CR) 10JMeybohm: "Sure, but I wanted to verify first (with you and with the service) that this is a proper approach." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [07:36:53] (03PS3) 10Slyngshede: P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) [07:37:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3772/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [07:39:33] (03PS1) 10Marostegui: mariadb: Add db2230 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1068662 [07:39:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: Testing [07:39:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: Testing [07:40:00] (03PS4) 10Slyngshede: P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) [07:40:18] (03CR) 10Marostegui: [C:03+2] mariadb: Add db2230 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1068662 (owner: 10Marostegui) [07:40:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3773/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [07:43:09] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3774/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [07:44:48] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:46:23] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet [07:46:36] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host snapshot1011.eqiad.wmnet [07:47:14] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet [07:53:55] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1011.eqiad.wmnet [07:54:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:05] (03CR) 10Slyngshede: [V:03+1] P:idp Clean up CAS 6.6 and Tomcat 9 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [08:00:04] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T0800) [08:01:56] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:04:00] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:15:46] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:38] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:18:04] PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:19:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:19:47] (03PS1) 10Slyngshede: data.yaml Update email address. [puppet] - 10https://gerrit.wikimedia.org/r/1068673 [08:22:57] (03CR) 10David Caro: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [08:25:12] (03CR) 10David Caro: [V:03+1] "Tested on cloudcontrol1005:" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [08:25:46] (03CR) 10David Caro: [V:03+1 C:03+2] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [08:26:10] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:20] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:28:07] so hmm MediaWiki train! :) [08:28:30] (03CR) 10Ayounsi: [C:03+1] P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [08:31:37] (03PS2) 10Slyngshede: data.yaml Update email address. [puppet] - 10https://gerrit.wikimedia.org/r/1068673 [08:32:23] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068681 (https://phabricator.wikimedia.org/T366965) [08:32:25] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068681 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [08:33:08] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068681 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [08:34:24] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:10] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:10] (03PS1) 10Joely Rooke WMDE: Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068684 (https://phabricator.wikimedia.org/T66315) [08:40:19] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [08:40:27] (03PS1) 10Ayounsi: site.pp: prepare for idp-test2005 VM on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1068688 (https://phabricator.wikimedia.org/T372909) [08:41:22] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:41:30] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.20 refs T366965 [08:41:34] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [08:44:34] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:45:20] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1068688 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [08:46:10] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:13] (03CR) 10Ayounsi: [C:03+2] site.pp: prepare for idp-test2005 VM on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1068688 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [08:47:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068684 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [08:47:58] (03CR) 10Brouberol: [C:03+2] deployment_server: define postgresql-test read/write usernames [puppet] - 10https://gerrit.wikimedia.org/r/1067916 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [08:48:24] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:51:03] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp-test2005.wikimedia.org [08:51:04] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:51:10] RESOLVED: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:10] FIRING: [5x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:19] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2005.wikimedia.org - ayounsi@cumin1002" [08:58:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2005.wikimedia.org - ayounsi@cumin1002" [08:58:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:24] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache idp-test2005.wikimedia.org on all recursors [08:58:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test2005.wikimedia.org on all recursors [08:58:55] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2005.wikimedia.org - ayounsi@cumin1002" [08:58:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2005.wikimedia.org - ayounsi@cumin1002" [08:59:17] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS bookworm [08:59:38] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:00:41] (03CR) 10Cathal Mooney: [C:03+1] "Looks good!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 (owner: 10Ayounsi) [09:02:40] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:32] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:06:04] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow [09:06:10] RESOLVED: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:15] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow (duration: 00m 11s) [09:06:32] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:51] (03PS3) 10Cathal Mooney: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) [09:08:42] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:10:34] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:01] (03PS4) 10Cathal Mooney: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) [09:11:02] (03CR) 10Ayounsi: [C:03+1] "🚀" [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:11:40] FIRING: [5x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:46] (03CR) 10Cathal Mooney: [C:03+2] Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:12:22] (03Merged) 10jenkins-bot: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:13:05] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test2005.wikimedia.org with reason: host reimage [09:13:33] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2380.codfw.wmnet [09:14:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2380.codfw.wmnet [09:14:10] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test2005.wikimedia.org with reason: host reimage [09:16:43] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:18:41] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:18:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:21:40] RESOLVED: [5x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:37] RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:23:41] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:24:24] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "idp-test2005 - ayounsi@cumin1002" [09:24:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "idp-test2005 - ayounsi@cumin1002" [09:24:55] !log apply qos classifers and scedulers to interfaces on asw2-ulsfo T339850 [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:00] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [09:25:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:25:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:25:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:25:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:25:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T371742)', diff saved to https://phabricator.wikimedia.org/P68125 and previous config saved to /var/cache/conftool/dbconfig/20240829-092547-ladsgroup.json [09:25:52] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:27:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:28:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:28:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:28:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:28:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T370903)', diff saved to https://phabricator.wikimedia.org/P68126 and previous config saved to /var/cache/conftool/dbconfig/20240829-092819-ladsgroup.json [09:28:23] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:30:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102107 (10Clement_Goubert) [09:32:36] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [09:32:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [09:35:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:59] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:43:14] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:44:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:44:49] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:44:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:45:19] (03PS11) 10Clément Goubert: sre.k8s.pool-depool-node: Check calico and fix phab [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 [09:45:22] (03PS1) 10Ayounsi: Add idp-test2005 to acme_chief::certificates idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1068699 (https://phabricator.wikimedia.org/T372909) [09:45:36] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068699 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [09:45:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2010.codfw.wmnet [09:46:45] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2010.codfw.wmnet [09:46:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102173 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [09:47:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2010.codfw.wmnet [09:48:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2010.codfw.wmnet with OS bullseye [09:48:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:48:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [09:48:51] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:50:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:40] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T371742)', diff saved to https://phabricator.wikimedia.org/P68127 and previous config saved to /var/cache/conftool/dbconfig/20240829-095141-ladsgroup.json [09:51:45] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:52:32] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2010 - cgoubert@cumin1002" [09:52:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2010 - cgoubert@cumin1002" [09:52:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:52:36] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2010.codfw.wmnet 198.16.192.10.in-addr.arpa 8.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:52:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2010.codfw.wmnet 198.16.192.10.in-addr.arpa 8.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:52:40] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2010 [09:52:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2010 [09:52:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [09:53:25] (03PS1) 10Brouberol: postgresql-test: homogeneize namespace and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068703 (https://phabricator.wikimedia.org/T373503) [09:55:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:55:53] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:58:30] ^^ assume this is reimaging work on the k8s nodes [09:58:41] !log apply qos classifers and scedulers to interfaces on ulsfo CRs T339850 [09:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [09:59:59] topranks: yep [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1000) [10:00:12] No good way to silence this one I'm afraid [10:00:42] !log T372878 wikikube-worker2048.codfw.wmnet updated in netbox and homer running [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [10:01:04] ah well, I was going to run homer, but akosiaris beat me to it :D [10:01:12] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2048.codfw.wmnet [10:01:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2048.codfw.wmnet [10:01:13] you'll have my node being removed as well [10:01:32] claime: niah, I just saw mine, nothing else [10:01:32] yeah it's fine... if the peer is being removed from the CRs you can set bgp to 'false' in netbox for the host, then run homer against the CRs to remove it - even before it's moved to the new vlan (and thus homer will know the session shouldn't be on the CR) [10:01:50] but it takes so long to run homer against those CRs these days cos of the bgp stuff :( [10:01:57] yeah :( [10:02:09] I'm pondering if we can build the list of nodes offline on a schedule and just read it on each homer invocation [10:02:14] !log homer cr*codfw* commit 'T372878' [10:02:17] (similar to we do with capirca for the access-lists) [10:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:32] but that wouldn't help here with such operations on stuff that has only just changed [10:03:13] (03PS1) 10Hnowlan: k8s: rename mw2380 to wikikube-worker2050 [puppet] - 10https://gerrit.wikimedia.org/r/1068704 (https://phabricator.wikimedia.org/T372878) [10:03:34] yeah, honestly I just try to run homer as soon as the vlan move cookbook is done so it doesn't stay in that false alarm state too long [10:04:00] (03CR) 10Stevemunene: [C:03+1] postgresql-test: homogeneize namespace and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068703 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [10:04:20] (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw2380 to wikikube-worker2050 [puppet] - 10https://gerrit.wikimedia.org/r/1068704 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:04:32] claime: that's fine as far as I'm concerned thanks [10:04:59] but yeah, homer takes litteral minutes to run on CRs [10:05:13] Is it the config dump that's taking this long? I haven't looked into it [10:05:24] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2048.codfw.wmnet on all recursors [10:05:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2048.codfw.wmnet on all recursors [10:05:48] RESOLVED: KubernetesCalicoDown: mw2294.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2294.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:06:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P68128 and previous config saved to /var/cache/conftool/dbconfig/20240829-100648-ladsgroup.json [10:09:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2010.codfw.wmnet with reason: host reimage [10:09:58] claime: Homer iterates over every server across the site when run against a CR, checking which ones have the bgp attribute set to 'true' [10:10:09] to build the list of peers to add to the CR config [10:10:20] so this takes a long time... and the Netbox REST API is slow :( ] [10:10:55] (03CR) 10Hnowlan: [C:03+1] sre.k8s.pool-depool-node: Check calico and fix phab [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [10:11:02] my idea is to perhaps do that async, and save the list of peers somewhere, so when homer runs it can just pull the pre-compiled list and it's not a drag for the user [10:11:07] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 447, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:09] topranks: oh yeah that would be slow... I assumed based on past "trying to automate network gear" experience it would be the actual switch itself being slow [10:11:16] that helps with other changes, but if the peers are changing it obviously doesn't we need to recompile the list [10:11:40] the switches aren't exactly lightning fast but the config push and application isn't too bad [10:11:50] the delay is the Homer code building the config before that [10:12:15] the switches are quicker because it just needs to check the servers _connected to that one switch_ [10:12:20] ack [10:12:30] (03CR) 10Hnowlan: [C:03+2] k8s: rename mw2380 to wikikube-worker2050 [puppet] - 10https://gerrit.wikimedia.org/r/1068704 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:12:36] ultimately when we've moved everything to L3 on the switches we won't have BGP to the CR, so we won't need to do that site-wide check [10:13:13] yeah so basically even the partial moves won't make it faster, because it checks all of the site anyways [10:13:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2010.codfw.wmnet with reason: host reimage [10:13:48] yeah I think that's the best path, generating stuff "offline" brings its own limitations, and we have a path forward to get servers off the core routers [10:14:11] yeah.... *but* for you guys for anything on a new vlan you only need to run homer against the switch, not the CRs [10:14:14] say a new or moved host [10:14:18] and that's a lot quicker which is good [10:14:45] oh for sure, I've ran enough the past few days to feel the difference :D [10:14:48] XioNoX: yeah there are definitely drawbacks I'm not convinced it's the way to go either [10:16:14] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 529, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:33] there, all better until 10 minutes from now when we run another one xD [10:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:19:30] (03CR) 10Ayounsi: [C:03+2] Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 (owner: 10Ayounsi) [10:21:13] (03Merged) 10jenkins-bot: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 (owner: 10Ayounsi) [10:21:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P68130 and previous config saved to /var/cache/conftool/dbconfig/20240829-102155-ladsgroup.json [10:23:11] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:23:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:26:39] (03PS1) 10Ayounsi: Provision script: stop if no MAC provided for supermicro [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068717 [10:28:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T370903)', diff saved to https://phabricator.wikimedia.org/P68131 and previous config saved to /var/cache/conftool/dbconfig/20240829-102829-ladsgroup.json [10:28:34] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:29:35] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:30:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:34:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2010.codfw.wmnet with OS bullseye [10:34:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:34:26] !log homer lsw1-b6-codfw* commit 'T372878' [10:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:30] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [10:36:35] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow [10:36:45] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow (duration: 00m 10s) [10:37:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T371742)', diff saved to https://phabricator.wikimedia.org/P68132 and previous config saved to /var/cache/conftool/dbconfig/20240829-103702-ladsgroup.json [10:37:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:37:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:37:07] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2010.codfw.wmnet [10:37:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2010.codfw.wmnet [10:37:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2010.codfw.wmnet [10:37:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:37:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T371742)', diff saved to https://phabricator.wikimedia.org/P68133 and previous config saved to /var/cache/conftool/dbconfig/20240829-103724-ladsgroup.json [10:37:44] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2380 to wikikube-worker2050 [10:37:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102310 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [10:38:01] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:43:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P68134 and previous config saved to /var/cache/conftool/dbconfig/20240829-104336-ladsgroup.json [10:44:53] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2380 to wikikube-worker2050 - hnowlan@cumin1002" [10:45:56] (03PS1) 10Tiziano Fogli: ripeatlas: add ping to wmf anchros check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) [10:46:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2380 to wikikube-worker2050 - hnowlan@cumin1002" [10:46:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:46:27] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2050 [10:46:32] (03CR) 10Slyngshede: [C:03+1] Add idp-test2005 to acme_chief::certificates idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1068699 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [10:46:38] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2050 [10:46:53] (03CR) 10Ayounsi: [C:03+2] Add idp-test2005 to acme_chief::certificates idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1068699 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [10:47:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2380 to wikikube-worker2050 [10:47:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2380 to w... [10:47:57] (03PS2) 10Tiziano Fogli: ripeatlas: add ping to wmf anchros check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) [10:48:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2050.codfw.wmnet with OS bullseye [10:48:17] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2050.codfw.wmnet with OS bullseye [10:48:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [10:48:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [10:48:34] (03PS3) 10Tiziano Fogli: ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) [10:48:41] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2050.codfw.wmnet on all recursors [10:48:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2050.codfw.wmnet on all recursors [10:49:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2050.codfw.wmnet with OS bullseye [10:49:12] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2050.codfw.wmnet with OS bullseye [10:49:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102365 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [10:49:22] (03PS4) 10Tiziano Fogli: ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) [10:49:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [10:54:34] (03PS5) 10Tiziano Fogli: ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) [10:55:49] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591 (10akosiaris) 03NEW [10:56:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test2005.wikimedia.org with OS bookworm [10:56:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp-test2005.wikimedia.org [10:58:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P68136 and previous config saved to /var/cache/conftool/dbconfig/20240829-105844-ladsgroup.json [11:02:21] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2031.codfw.wmnet [11:02:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2031.codfw.wmnet [11:02:37] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102411 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [11:03:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2031.codfw.wmnet [11:05:24] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:06:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2031.codfw.wmnet with OS bullseye [11:06:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [11:06:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [11:06:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T371742)', diff saved to https://phabricator.wikimedia.org/P68137 and previous config saved to /var/cache/conftool/dbconfig/20240829-110637-ladsgroup.json [11:06:42] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:07:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:10:59] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2031 - cgoubert@cumin1002" [11:11:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2031 - cgoubert@cumin1002" [11:11:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:11:03] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2031.codfw.wmnet 179.0.192.10.in-addr.arpa 9.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:11:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2031.codfw.wmnet 179.0.192.10.in-addr.arpa 9.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:11:07] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2031 [11:13:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2031 [11:13:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [11:13:39] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2050.codfw.wmnet with OS bullseye [11:13:49] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2050 [11:13:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2050 [11:13:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T370903)', diff saved to https://phabricator.wikimedia.org/P68138 and previous config saved to /var/cache/conftool/dbconfig/20240829-111351-ladsgroup.json [11:13:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:13:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:13:58] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:14:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:15:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:17:07] !log homer cr*codfw* commit 'T372878' [11:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:11] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:17:21] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:21:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P68139 and previous config saved to /var/cache/conftool/dbconfig/20240829-112145-ladsgroup.json [11:22:13] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:22:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:22:30] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update cfssl-issuer to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1068026 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [11:22:57] (03CR) 10JMeybohm: [C:03+2] Pin cfssl-issuer and CRDs chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068027 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [11:24:58] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:24:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:26:11] (03CR) 10JMeybohm: sre.k8s.pool-depool-node: Check calico and fix phab (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [11:26:32] (03Merged) 10jenkins-bot: Pin cfssl-issuer and CRDs chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068027 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [11:27:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 445, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:55] (03CR) 10Clément Goubert: sre.k8s.pool-depool-node: Check calico and fix phab (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [11:29:55] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2050.codfw.wmnet with reason: host reimage [11:30:29] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:30:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:30:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage [11:31:02] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:31:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:31:22] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:31:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:32:09] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:32:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:32:31] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:32:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:32:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2050.codfw.wmnet with reason: host reimage [11:34:52] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:34:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:35:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage [11:35:49] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [11:36:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P68140 and previous config saved to /var/cache/conftool/dbconfig/20240829-113652-ladsgroup.json [11:37:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [11:41:35] (03CR) 10Brouberol: [C:03+2] postgresql-test: homogeneize namespace and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068703 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [11:41:44] !log modify qos configuration for asw2-ulsfo xe-2/0/18 (ganeti4006) to add traffic-control-profile T339850 [11:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:48] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [11:43:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:44:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:44:43] (03CR) 10Jaime Nuche: releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:45:55] 06SRE, 10SRE-swift-storage, 06Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons - https://phabricator.wikimedia.org/T321869#10102440 (10Yann) Idem here: https://commons.wikimedia.org/wiki/File:Indium.jpg The first version is not available: https://upload.wikimedia.org/... [11:46:25] (03PS1) 10Brouberol: postgresql-test: deploy to postgresql-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068742 (https://phabricator.wikimedia.org/T373503) [11:46:26] PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% [11:47:50] (03CR) 10Brouberol: [C:03+2] postgresql-test: deploy to postgresql-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068742 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [11:48:22] RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [11:51:06] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:51:29] (03PS1) 10Brouberol: Rename file to include pgbouncer image tag value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068743 [11:51:37] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:51:47] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s4 to test-s4 [11:51:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2050.codfw.wmnet with OS bullseye [11:52:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T371742)', diff saved to https://phabricator.wikimedia.org/P68141 and previous config saved to /var/cache/conftool/dbconfig/20240829-115200-ladsgroup.json [11:52:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:52:05] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:52:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [11:52:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from test-s4 to test-s4 [11:52:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:52:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T371742)', diff saved to https://phabricator.wikimedia.org/P68142 and previous config saved to /var/cache/conftool/dbconfig/20240829-115222-ladsgroup.json [11:52:36] (03CR) 10Brouberol: [C:03+2] Rename file to include pgbouncer image tag value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068743 (owner: 10Brouberol) [11:55:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2031.codfw.wmnet with OS bullseye [11:55:16] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:56:44] !log homer lsw1-a6-codfw* commit 'T372878' [11:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:48] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:57:02] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 527, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1200) [12:00:40] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2031.codfw.wmnet [12:00:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2031.codfw.wmnet [12:00:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2031.codfw.wmnet [12:00:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10102470 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:01:17] (03PS12) 10Clément Goubert: sre.k8s.pool-depool-node: Check calico and fix phab [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 [12:02:39] (03PS1) 10Jelto: gerrit: increase throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1068744 (https://phabricator.wikimedia.org/T365259) [12:04:16] (03CR) 10Clément Goubert: sre.k8s.pool-depool-node: Check calico and fix phab (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [12:08:37] (03CR) 10AikoChou: [C:03+2] ml-services: add new revertrisk isvcs for pre-save context (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [12:09:51] (03Merged) 10jenkins-bot: ml-services: add new revertrisk isvcs for pre-save context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [12:10:23] !log homer 'lsw1-a3-codfw*' commit [12:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T371742)', diff saved to https://phabricator.wikimedia.org/P68143 and previous config saved to /var/cache/conftool/dbconfig/20240829-121444-ladsgroup.json [12:14:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:18:58] (03PS1) 10JMeybohm: global_config: Add pki::multirootca IPs to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1068754 (https://phabricator.wikimedia.org/T337928) [12:19:23] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068755 [12:20:21] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s4 to test-s4 [12:21:07] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3775/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068754 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [12:21:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s4 to test-s4 [12:22:19] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s4 to test-s4 [12:22:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from test-s4 to test-s4 [12:25:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:25:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:25:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T370903)', diff saved to https://phabricator.wikimedia.org/P68144 and previous config saved to /var/cache/conftool/dbconfig/20240829-122527-ladsgroup.json [12:25:32] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:29:39] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#10102529 (10cmooney) [12:29:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P68145 and previous config saved to /var/cache/conftool/dbconfig/20240829-122951-ladsgroup.json [12:31:10] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:27] (03PS2) 10JMeybohm: global_config: Add pki::multirootca IPs to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1068754 (https://phabricator.wikimedia.org/T337928) [12:40:22] (03PS2) 10JMeybohm: Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) [12:40:22] (03PS1) 10JMeybohm: cfssl-issuer: Add external-services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068768 (https://phabricator.wikimedia.org/T359423) [12:44:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P68146 and previous config saved to /var/cache/conftool/dbconfig/20240829-124459-ladsgroup.json [12:47:52] (03PS1) 10Brouberol: cloudnative-pg: enable pooler->PG and PG->PG traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068769 (https://phabricator.wikimedia.org/T373503) [12:48:27] (03CR) 10Elukey: [C:03+1] data.yaml Update email address. [puppet] - 10https://gerrit.wikimedia.org/r/1068673 (owner: 10Slyngshede) [12:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:51:41] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow [12:51:51] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@cb0bc4d]: Test Refine through Airflow (duration: 00m 09s) [12:54:34] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10102581 (10ssingh) >>! In T373518#10101487, @KFrancis wrote: > Hi all, I'm confirming the NDA is signed. Please proceed with next steps. Thanks as always @KFrancis! @j... [12:58:38] (03PS2) 10Brouberol: cloudnative-pg: enable pooler->PG and PG->PG traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068769 (https://phabricator.wikimedia.org/T373503) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1300). [13:00:05] joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T371742)', diff saved to https://phabricator.wikimedia.org/P68147 and previous config saved to /var/cache/conftool/dbconfig/20240829-130006-ladsgroup.json [13:00:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:00:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:00:24] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:00:27] * TheresNoTime can deploy! [13:00:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T371742)', diff saved to https://phabricator.wikimedia.org/P68148 and previous config saved to /var/cache/conftool/dbconfig/20240829-130029-ladsgroup.json [13:00:41] Hi, I'm around! [13:01:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068684 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:02:07] (03Merged) 10jenkins-bot: Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068684 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:02:36] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1068684|Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]] [13:03:21] (03CR) 10Samtar: "`.20` is on prod, ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [13:03:36] (03PS5) 10Samtar: Add CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) [13:05:49] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10102617 (10elukey) [13:05:53] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10102610 (10elukey) Thanks a lot for the task! Something similar happened recently for puppetserver1001 as well: https://grafana.wikimedia.org/goto/J6dYu13SR?orgI... [13:06:35] !log samtar@deploy1003 joelyrookewmde, samtar: Backport for [[gerrit:1068684|Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:40] joelyrookewmde: ready for testing on mwdebug :) [13:06:40] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#10102636 (10cmooney) [13:07:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:17] works, thank you! [13:08:22] !log samtar@deploy1003 joelyrookewmde, samtar: Continuing with sync [13:09:32] (03PS14) 10Clément Goubert: sre.k8s.renumber-node: vlan, IP change k8s workers [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 [13:12:33] 06SRE, 06Infrastructure-Foundations, 10netops: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10102654 (10cmooney) [13:13:05] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068684|Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]] (duration: 10m 28s) [13:13:24] joelyrookewmde: live on prod [13:13:55] amazing thank you! [13:14:10] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:13] 06SRE, 06Infrastructure-Foundations, 10netops: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10102648 (10cmooney) [13:15:31] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1068754 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [13:15:52] (03PS1) 10Elukey: profile::prometheus::ops: collect Puppetserver's metrics [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) [13:16:17] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg: enable pooler->PG and PG->PG traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068769 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [13:16:33] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: enable pooler->PG and PG->PG traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068769 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [13:18:43] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3778/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:19:07] (03CR) 10Elukey: profile::prometheus::ops: collect Puppetserver's metrics [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:19:41] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10102686 (10elukey) p:05Triage→03High [13:22:55] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [13:24:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T371742)', diff saved to https://phabricator.wikimedia.org/P68149 and previous config saved to /var/cache/conftool/dbconfig/20240829-132416-ladsgroup.json [13:24:21] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:24:27] (03CR) 10Elukey: [C:03+1] Provision script: stop if no MAC provided for supermicro [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068717 (owner: 10Ayounsi) [13:24:37] (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: upgrade nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:25:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T370903)', diff saved to https://phabricator.wikimedia.org/P68150 and previous config saved to /var/cache/conftool/dbconfig/20240829-132523-ladsgroup.json [13:25:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:25:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [13:25:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [13:25:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [13:25:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [13:26:10] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:13] (03CR) 10Elukey: [C:03+1] Provision script: Assign the mgmt IP as oob_ip [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068008 (owner: 10Ayounsi) [13:29:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1002.eqiad.wmnet with OS bookworm [13:30:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [13:31:12] (03PS1) 10Kamila Součková: kubernetes: Rename mw2401 to wikikube-worker2051 [puppet] - 10https://gerrit.wikimedia.org/r/1068776 (https://phabricator.wikimedia.org/T372878) [13:34:48] FIRING: KubernetesCalicoDown: wikikube-worker2050.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2050.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:36:22] (03PS4) 10Kgraessle: Enable AutoModerator on id.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) [13:37:16] (03PS1) 10Marostegui: installserver: Wipe db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1068778 [13:37:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102716 (10Jclark-ctr) a:05klausman→03Jclark-ctr [13:39:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P68151 and previous config saved to /var/cache/conftool/dbconfig/20240829-133923-ladsgroup.json [13:40:04] (03CR) 10Marostegui: [C:03+2] installserver: Wipe db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1068778 (owner: 10Marostegui) [13:40:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P68152 and previous config saved to /var/cache/conftool/dbconfig/20240829-134030-ladsgroup.json [13:42:16] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1068656 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [13:42:38] (03PS1) 10Marostegui: installserver: Do not wipe db2232 and db2233 [puppet] - 10https://gerrit.wikimedia.org/r/1068782 [13:43:23] (03CR) 10Clément Goubert: [C:03+1] kubernetes: Rename mw2401 to wikikube-worker2051 [puppet] - 10https://gerrit.wikimedia.org/r/1068776 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [13:43:55] (03CR) 10Klausman: [V:03+2 C:03+2] preseed: Add ml-lab machines and dse-k8s-worker1009 [puppet] - 10https://gerrit.wikimedia.org/r/1068656 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [13:44:46] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:44:47] (03CR) 10Ssingh: "Forgive my ignorance: but where and how is the 8141 port specified?" [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:45:39] (03CR) 10Marostegui: [C:03+2] installserver: Do not wipe db2232 and db2233 [puppet] - 10https://gerrit.wikimedia.org/r/1068782 (owner: 10Marostegui) [13:49:14] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt logging-sd1 - jclark@cumin1002" [13:49:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt logging-sd1 - jclark@cumin1002" [13:49:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host logging-sd1001.mgmt.eqiad.wmnet with reboot policy FORCED [13:50:58] !log add qos interface schedulers on lsw1-d4-codfw T339850 [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [13:51:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host logging-sd1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:51:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host logging-sd1003.mgmt.eqiad.wmnet with reboot policy FORCED [13:51:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host logging-sd1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:52:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1002.eqiad.wmnet with OS bookworm [13:52:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [13:52:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1002.eqiad.wmnet with OS bookworm [13:53:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [13:54:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P68153 and previous config saved to /var/cache/conftool/dbconfig/20240829-135430-ladsgroup.json [13:54:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [13:54:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [13:54:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [13:55:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10102805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [13:55:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P68154 and previous config saved to /var/cache/conftool/dbconfig/20240829-135537-ladsgroup.json [13:55:44] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:57:42] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:59:14] (03PS1) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [13:59:48] RESOLVED: KubernetesCalicoDown: wikikube-worker2050.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2050.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:59:57] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2050.codfw.wmnet [13:59:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2050.codfw.wmnet [14:03:13] 06SRE, 06Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492#10102834 (10ssingh) @ayounsi, @cmooney: Is it fine to remove testlb (and testlb6)? Is there any continued use for them? [14:03:41] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10102839 (10hnowlan) [14:05:38] (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [14:05:58] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:06:10] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10102850 (10cmooney) 05Resolved→03Open Hey @Jhancock.wm as discussed on irc I was hoping to get the main uplink for this server moved from... [14:09:34] (03CR) 10Elukey: "It is a very good question - in theory, IIUC, prometheus::jmx_exporter_config gathers the JMX port set in puppet and uses it in the config" [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [14:09:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T371742)', diff saved to https://phabricator.wikimedia.org/P68155 and previous config saved to /var/cache/conftool/dbconfig/20240829-140937-ladsgroup.json [14:09:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [14:09:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [14:10:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T371742)', diff saved to https://phabricator.wikimedia.org/P68156 and previous config saved to /var/cache/conftool/dbconfig/20240829-140959-ladsgroup.json [14:10:05] (03PS2) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [14:10:14] (03CR) 10Hnowlan: [C:03+1] kubernetes: Rename mw2401 to wikikube-worker2051 [puppet] - 10https://gerrit.wikimedia.org/r/1068776 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [14:10:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T370903)', diff saved to https://phabricator.wikimedia.org/P68157 and previous config saved to /var/cache/conftool/dbconfig/20240829-141045-ladsgroup.json [14:10:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:11:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:11:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68158 and previous config saved to /var/cache/conftool/dbconfig/20240829-141107-ladsgroup.json [14:11:21] (03CR) 10Elukey: "This is in the change catalog:" [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [14:14:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1001.mgmt.eqiad.wmnet with reboot policy FORCED [14:14:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:14:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:15:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:16:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-sd1001.eqiad.wmnet with OS bookworm [14:16:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-sd1002.eqiad.wmnet with OS bookworm [14:16:32] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-sd1003.eqiad.wmnet with OS bookworm [14:16:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-sd1004.eqiad.wmnet with OS bookworm [14:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10102914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-sd1001.eqiad.wmnet with OS bookworm [14:16:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10102915 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-sd1002.eqiad.wmnet with OS bookworm [14:16:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10102916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-sd1003.eqiad.wmnet with OS bookworm [14:16:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10102917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-sd1004.eqiad.wmnet with OS bookworm [14:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:24:39] (03CR) 10Brian Wolff: varnish: Add restrictive CSP to upload.wikimedia.org for testwiki only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [14:24:54] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10102968 (10elukey) Something very strange: ` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 200986 puppet... [14:25:49] !log jgiannelos@deploy1003 Started deploy [restbase/deploy@5a4727a]: (no justification provided) [14:26:17] (03CR) 10Ssingh: [C:03+1] "Thanks for sharing and clarifying!" [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [14:27:25] (03PS14) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [14:32:21] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1001.eqiad.wmnet with reason: host reimage [14:32:31] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1002.eqiad.wmnet with reason: host reimage [14:32:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1003.eqiad.wmnet with reason: host reimage [14:32:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1004.eqiad.wmnet with reason: host reimage [14:34:52] (03PS1) 10Jgiannelos: mobileapps: Enable caching after RESTBase sets the UA header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068798 [14:35:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T371742)', diff saved to https://phabricator.wikimedia.org/P68159 and previous config saved to /var/cache/conftool/dbconfig/20240829-143514-ladsgroup.json [14:35:19] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:35:36] (03PS2) 10Jgiannelos: mobileapps: Re-enable caching after RESTBase sets the UA header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068798 (https://phabricator.wikimedia.org/T319365) [14:35:59] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [14:36:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1001.eqiad.wmnet with reason: host reimage [14:36:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [14:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1004.eqiad.wmnet with reason: host reimage [14:38:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1002.eqiad.wmnet with OS bookworm [14:39:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103064 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [14:40:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [14:41:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [14:41:42] (03CR) 10Elukey: [C:03+2] profile::prometheus::ops: collect Puppetserver's metrics [puppet] - 10https://gerrit.wikimedia.org/r/1068773 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [14:42:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1002.eqiad.wmnet with OS bookworm [14:42:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1002.eqiad.wmnet with reason: host reimage [14:42:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [14:42:24] !log jgiannelos@deploy1003 Finished deploy [restbase/deploy@5a4727a]: (no justification provided) (duration: 16m 35s) [14:46:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1003.eqiad.wmnet with reason: host reimage [14:46:22] (03PS2) 10Dzahn: releases: upgrade Java JDK version from 11 to 17 [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) [14:47:41] (03CR) 10Dzahn: releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:49:50] (03CR) 10Dzahn: [C:03+2] releases: upgrade Java JDK version from 11 to 17 [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:50:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P68160 and previous config saved to /var/cache/conftool/dbconfig/20240829-145021-ladsgroup.json [14:52:53] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:52:58] (03CR) 10Dzahn: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:53:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:53:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1001.eqiad.wmnet with OS bookworm [14:53:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10103122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-sd1001.eqiad.wmnet with OS bookworm completed: - logging-sd10... [14:53:29] (03CR) 10Brouberol: [C:03+1] "I'll test whether things work as expected. Thanks Antoine!" [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031) (owner: 10Hashar) [14:53:33] (03CR) 10Brouberol: [C:03+2] archiva: allow trailing slash for top directories [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031) (owner: 10Hashar) [14:55:21] !log downtiming lvs4010 to test T358260 [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:25] T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260 [14:55:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2014.codfw.wmnet with reason: testing T358260 [14:55:56] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:56:00] woah woah [14:56:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2014.codfw.wmnet with reason: testing T358260 [14:56:03] ha [14:56:10] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2014.codfw.wmnet [14:56:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2014.codfw.wmnet [14:56:11] (03CR) 10Dzahn: [C:03+2] "if we stop passing the java_home variable to modules/jenkins/init.pp, the default will be a fallback to /usr/lib/jvm/java-8-openjdk-amd64/" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:56:30] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: testing T358260 [14:56:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:56:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1004.eqiad.wmnet with OS bookworm [14:56:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10103134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-sd1004.eqiad.wmnet with OS bookworm completed: - logging-sd10... [14:56:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: testing T358260 [14:58:57] (03CR) 10Dzahn: [C:03+2] "so basically we could have just merged this change as-is. if the java_home setting is irrelevant anyways.. all of the cleanup could have c" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:58:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-lab1002.eqiad.wmnet with reason: host reimage [14:59:18] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:59:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:59:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1002.eqiad.wmnet with OS bookworm [14:59:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10103140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-sd1002.eqiad.wmnet with OS bookworm completed: - logging-sd10... [15:00:05] hashar and andre: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1500) [15:00:41] (03CR) 10Sérgio Lopes: [C:03+1] mobileapps: Re-enable caching after RESTBase sets the UA header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068798 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:00:56] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Re-enable caching after RESTBase sets the UA header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068798 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:01:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:38] (03CR) 10MSantos: [C:03+1] mobileapps: Re-enable caching after RESTBase sets the UA header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068798 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:01:55] (03PS1) 10Dzahn: releases: drop java_home parameter for releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1068801 (https://phabricator.wikimedia.org/T359795) [15:02:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-lab1002.eqiad.wmnet with reason: host reimage [15:03:34] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1068801" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [15:03:40] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:04:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:04:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1003.eqiad.wmnet with OS bookworm [15:04:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10103155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-sd1003.eqiad.wmnet with OS bookworm completed: - logging-sd10... [15:04:33] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet [15:04:37] !log releases* - temp disable puppet, maintenance for java version upgrade [15:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:43] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:24] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:05:25] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P68161 and previous config saved to /var/cache/conftool/dbconfig/20240829-150529-ladsgroup.json [15:05:35] ^ expected BGP alerts [15:06:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2295.codfw.wmnet [15:07:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet [15:07:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2295.codfw.wmnet [15:07:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2296.codfw.wmnet [15:07:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [15:07:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10103161 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:07:27] PROBLEM - Host lvs4010 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:29] RECOVERY - Host lvs4010 is UP: PING OK - Packet loss = 0%, RTA = 71.06 ms [15:07:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103167 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [15:07:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2296.codfw.wmnet [15:07:54] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2297.codfw.wmnet [15:08:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2297.codfw.wmnet [15:08:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68162 and previous config saved to /var/cache/conftool/dbconfig/20240829-150846-ladsgroup.json [15:08:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:08:55] (03Merged) 10jenkins-bot: mobileapps: Re-enable caching after RESTBase sets the UA header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068798 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:09:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:09:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:09:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:09:39] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:09:40] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:09:42] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:09:45] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:09:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:10:00] (03CR) 10Jsn.sherman: "Thanks for the followup: This kind of makes it look like the whole array is for idwiki, I suggest setting a task comment per wiki config l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [15:10:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68163 and previous config saved to /var/cache/conftool/dbconfig/20240829-151000-ladsgroup.json [15:10:25] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:10:29] PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:10:42] ^ expected [15:10:43] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:11:10] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:25] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:12:29] PROBLEM - PyBal connections to etcd on lvs4010 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [15:16:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:16:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:16:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-lab1002.eqiad.wmnet with OS bookworm [15:16:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [15:19:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68164 and previous config saved to /var/cache/conftool/dbconfig/20240829-151903-ladsgroup.json [15:19:08] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:20:27] The postorius service for mailman3 seems to be out to lunch. https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/ just keeps spinning for me. [15:20:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T371742)', diff saved to https://phabricator.wikimedia.org/P68165 and previous config saved to /var/cache/conftool/dbconfig/20240829-152036-ladsgroup.json [15:20:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [15:20:43] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:20:51] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:20:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [15:20:59] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:20:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T371742)', diff saved to https://phabricator.wikimedia.org/P68166 and previous config saved to /var/cache/conftool/dbconfig/20240829-152058-ladsgroup.json [15:21:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:49] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:22:44] (03PS5) 10Kgraessle: Enable AutoModerator on id.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) [15:23:02] (03CR) 10Kgraessle: Enable AutoModerator on id.wiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [15:24:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P68167 and previous config saved to /var/cache/conftool/dbconfig/20240829-152419-ladsgroup.json [15:24:39] (03CR) 10Jsn.sherman: [C:03+1] "looks good; thanks for taking care of this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [15:26:30] (03CR) 10Ayounsi: [C:03+2] Provision script: stop if no MAC provided for supermicro [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068717 (owner: 10Ayounsi) [15:26:34] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10103267 (10elukey) Created https://grafana-rw.wikimedia.org/d/e0f6afe3-2aea-483d-9f5e-55f0cba9207f/puppetserver, didn't add all the metrics... [15:26:38] (03PS31) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [15:27:54] 06SRE, 06Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492#10103272 (10ayounsi) @ssingh yep, you can clean it up anytime, thanks ! [15:28:33] (03Merged) 10jenkins-bot: Provision script: stop if no MAC provided for supermicro [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068717 (owner: 10Ayounsi) [15:29:17] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:29:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:30:07] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:30:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:30:41] (03CR) 10Ayounsi: [C:03+2] Provision script: Assign the mgmt IP as oob_ip [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068008 (owner: 10Ayounsi) [15:31:10] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:32:36] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:32:46] (03Merged) 10jenkins-bot: Provision script: Assign the mgmt IP as oob_ip [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068008 (owner: 10Ayounsi) [15:33:05] (03PS1) 10Cathal Mooney: Apply different qos sceduler config for EX4600 platform [homer/public] - 10https://gerrit.wikimedia.org/r/1068807 (https://phabricator.wikimedia.org/T373594) [15:33:11] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:33:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [15:33:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:33:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [15:34:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P68168 and previous config saved to /var/cache/conftool/dbconfig/20240829-153410-ladsgroup.json [15:35:09] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:35:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:39:17] !log re-enable puppet on lvs4010 [15:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P68169 and previous config saved to /var/cache/conftool/dbconfig/20240829-153925-ladsgroup.json [15:40:33] RECOVERY - pybal on lvs4010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:40:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T371742)', diff saved to https://phabricator.wikimedia.org/P68170 and previous config saved to /var/cache/conftool/dbconfig/20240829-154040-ladsgroup.json [15:40:45] RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:40:46] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:42:09] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs4010.ulsfo.wmnet [15:42:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4010.ulsfo.wmnet [15:42:27] RECOVERY - PyBal connections to etcd on lvs4010 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [15:46:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:47:35] (03CR) 10Hashar: "Thank you! https://archiva.wikimedia.org/repository/releases pass through and is then redirected to the version with a trailing slash \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031) (owner: 10Hashar) [15:48:02] (03CR) 10Ayounsi: [C:03+1] Apply different qos sceduler config for EX4600 platform [homer/public] - 10https://gerrit.wikimedia.org/r/1068807 (https://phabricator.wikimedia.org/T373594) (owner: 10Cathal Mooney) [15:49:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P68171 and previous config saved to /var/cache/conftool/dbconfig/20240829-154917-ladsgroup.json [15:50:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:54:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P68172 and previous config saved to /var/cache/conftool/dbconfig/20240829-155431-ladsgroup.json [15:55:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P68173 and previous config saved to /var/cache/conftool/dbconfig/20240829-155547-ladsgroup.json [15:59:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:04:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [16:04:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10103469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:04:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68174 and previous config saved to /var/cache/conftool/dbconfig/20240829-160425-ladsgroup.json [16:04:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:04:30] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:04:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:04:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T370903)', diff saved to https://phabricator.wikimedia.org/P68175 and previous config saved to /var/cache/conftool/dbconfig/20240829-160447-ladsgroup.json [16:05:05] (03PS1) 10Elukey: blubber: update the buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1068815 [16:05:35] (03CR) 10Hnowlan: [C:03+1] blubber: update the buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1068815 (owner: 10Elukey) [16:06:10] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T370903)', diff saved to https://phabricator.wikimedia.org/P68176 and previous config saved to /var/cache/conftool/dbconfig/20240829-160757-ladsgroup.json [16:08:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [16:08:59] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [16:08:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:10:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P68177 and previous config saved to /var/cache/conftool/dbconfig/20240829-161054-ladsgroup.json [16:16:50] (03CR) 10Elukey: [C:03+2] blubber: update the buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1068815 (owner: 10Elukey) [16:19:57] (03PS2) 10Cathal Mooney: Apply different qos sceduler config for EX4600 platform [homer/public] - 10https://gerrit.wikimedia.org/r/1068807 (https://phabricator.wikimedia.org/T373594) [16:20:35] (03Merged) 10jenkins-bot: blubber: update the buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1068815 (owner: 10Elukey) [16:21:45] (03CR) 10Cathal Mooney: [C:03+2] Apply different qos sceduler config for EX4600 platform [homer/public] - 10https://gerrit.wikimedia.org/r/1068807 (https://phabricator.wikimedia.org/T373594) (owner: 10Cathal Mooney) [16:22:24] (03Merged) 10jenkins-bot: Apply different qos sceduler config for EX4600 platform [homer/public] - 10https://gerrit.wikimedia.org/r/1068807 (https://phabricator.wikimedia.org/T373594) (owner: 10Cathal Mooney) [16:23:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P68178 and previous config saved to /var/cache/conftool/dbconfig/20240829-162304-ladsgroup.json [16:23:56] (03PS1) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [16:26:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T371742)', diff saved to https://phabricator.wikimedia.org/P68179 and previous config saved to /var/cache/conftool/dbconfig/20240829-162601-ladsgroup.json [16:26:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [16:26:06] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:26:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [16:27:11] (03CR) 10CI reject: [V:04-1] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [16:27:15] !log update qos configuration for asw2-ulsfo to use traffic-control profile T373594 [16:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:19] T373594: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594 [16:28:40] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:19] (03PS2) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [16:34:44] (03PS1) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [16:34:58] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10103590 (10bd808) >>! In T353891#9429101, @Ladsgroup wrote: > I see the problem in two areas only: > - Opening the main page > - Opening the pa... [16:35:16] 06SRE, 06Data-Engineering, 06Data-Platform: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10103600 (10andrea.denisse) [16:38:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P68180 and previous config saved to /var/cache/conftool/dbconfig/20240829-163811-ladsgroup.json [16:39:39] (03PS1) 10Elukey: services: update Thumbor's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068819 (https://phabricator.wikimedia.org/T373618) [16:40:39] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10103630 (10andrea.denisse) [16:40:43] 06SRE, 06Data-Engineering, 06Data-Platform: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10103631 (10andrea.denisse) [16:45:08] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10103639 (10andrea.denisse) p:05Triage→03High Hi team, please take a look at both the `aqs1013`, and the `aqs1014` hosts, the degraded raid alert is firing since April 18 creating unnecessar... [16:47:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T367856)', diff saved to https://phabricator.wikimedia.org/P68182 and previous config saved to /var/cache/conftool/dbconfig/20240829-164717-marostegui.json [16:47:22] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [16:47:41] 06SRE, 06Data-Engineering, 06Data-Platform: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10103656 (10andrea.denisse) p:05Triage→03High [16:48:19] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10103651 (10andrea.denisse) Hi team, please take a look at both the `aqs1013`, and the `aqs1014` hosts, the degraded raid alert is firing since April 18 creating unnecessary noise for SREs inbox... [16:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:49:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:50:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:50:14] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939#10103675 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Rebalanced power [16:53:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T370903)', diff saved to https://phabricator.wikimedia.org/P68183 and previous config saved to /var/cache/conftool/dbconfig/20240829-165319-ladsgroup.json [16:53:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:53:24] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:53:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:53:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T370903)', diff saved to https://phabricator.wikimedia.org/P68184 and previous config saved to /var/cache/conftool/dbconfig/20240829-165341-ladsgroup.json [16:54:19] (03CR) 10Hnowlan: [C:03+1] services: update Thumbor's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068819 (https://phabricator.wikimedia.org/T373618) (owner: 10Elukey) [16:54:22] (03PS3) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [16:54:22] (03PS2) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [16:56:06] jouncebot: nowandnext [16:56:13] oh. [16:56:54] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views for eo/sv/fi wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) [16:57:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) (owner: 10C. Scott Ananian) [16:57:52] (03PS1) 10Andrew Bogott: codfw1dev: fix ldap hostname (and cert) for horizon access [puppet] - 10https://gerrit.wikimedia.org/r/1068823 [16:58:11] (03CR) 10CI reject: [V:04-1] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [16:58:42] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: fix ldap hostname (and cert) for horizon access [puppet] - 10https://gerrit.wikimedia.org/r/1068823 (owner: 10Andrew Bogott) [16:59:03] (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for eo/sv/fi wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) (owner: 10C. Scott Ananian) [16:59:25] (03PS1) 10Alexandros Kosiaris: Rename mw229[567] to wikikube-worker205[123] [puppet] - 10https://gerrit.wikimedia.org/r/1068824 (https://phabricator.wikimedia.org/T372878) [17:00:38] (03PS4) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [17:00:38] (03PS3) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [17:01:48] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T373235#10103728 (10VRiley-WMF) a:03VRiley-WMF [17:02:10] (03CR) 10CI reject: [V:04-1] Rename mw229[567] to wikikube-worker205[123] [puppet] - 10https://gerrit.wikimedia.org/r/1068824 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [17:02:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P68185 and previous config saved to /var/cache/conftool/dbconfig/20240829-170224-marostegui.json [17:02:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T370903)', diff saved to https://phabricator.wikimedia.org/P68186 and previous config saved to /var/cache/conftool/dbconfig/20240829-170252-ladsgroup.json [17:02:57] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:04:32] (03CR) 10CI reject: [V:04-1] [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 (owner: 10Ladsgroup) [17:04:35] (03CR) 10CI reject: [V:04-1] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [17:04:53] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: Rename mw2401 to wikikube-worker2051 [puppet] - 10https://gerrit.wikimedia.org/r/1068776 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [17:05:52] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [17:08:06] (03PS5) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [17:08:06] (03PS4) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [17:12:00] (03CR) 10CI reject: [V:04-1] [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 (owner: 10Ladsgroup) [17:12:05] (03CR) 10CI reject: [V:04-1] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [17:13:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:13:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:14:10] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:20] (03CR) 10Kamila Součková: [C:03+2] kubernetes: Rename mw2401 to wikikube-worker2051 [puppet] - 10https://gerrit.wikimedia.org/r/1068776 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [17:16:41] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2401.codfw.wmnet [17:17:15] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2401.codfw.wmnet [17:17:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P68187 and previous config saved to /var/cache/conftool/dbconfig/20240829-171733-marostegui.json [17:18:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P68188 and previous config saved to /var/cache/conftool/dbconfig/20240829-171759-ladsgroup.json [17:18:49] jouncebot: now [17:18:49] For the next 0 hour(s) and 41 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1700) [17:18:49] For the next 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T1700) [17:19:32] I feel like this was the 2nd or 3rd week in a row that jouncebot fell off irc right as "my" window should have been announced. I probably should look into that... [17:19:59] * bd808 has nothing to deploy but noticed he didn't have a ping to remind him to check [17:21:11] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2401 to wikikube-worker2051 [17:21:27] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:22:27] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:23:09] TheresNoTime: Thanks for restarting jouncebot. I just saw that you did that. [17:25:48] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2401 to wikikube-worker2051 - kamila@cumin1002" [17:26:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2401 to wikikube-worker2051 - kamila@cumin1002" [17:26:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:26:08] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2051 [17:26:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2051 [17:27:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2401 to wikikube-worker2051 [17:27:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10103822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2401 to wi... [17:27:47] (03CR) 10Andrea Denisse: "Hi Filippo, thanks for taking a look. I'm changing the active hosts in this patch." [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [17:27:49] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2051.codfw.wmnet with OS bullseye [17:27:59] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2051 [17:28:05] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10103824 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [17:28:10] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:31:17] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2051 - kamila@cumin1002" [17:31:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2051 - kamila@cumin1002" [17:31:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:31:22] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2051.codfw.wmnet 65.0.192.10.in-addr.arpa 5.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:31:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2051.codfw.wmnet 65.0.192.10.in-addr.arpa 5.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:31:26] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2051 [17:32:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2051 [17:32:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2051 [17:32:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T367856)', diff saved to https://phabricator.wikimedia.org/P68189 and previous config saved to /var/cache/conftool/dbconfig/20240829-173240-marostegui.json [17:32:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2167.codfw.wmnet with reason: Maintenance [17:32:45] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [17:32:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2167.codfw.wmnet with reason: Maintenance [17:33:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T367856)', diff saved to https://phabricator.wikimedia.org/P68190 and previous config saved to /var/cache/conftool/dbconfig/20240829-173303-marostegui.json [17:33:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P68191 and previous config saved to /var/cache/conftool/dbconfig/20240829-173313-ladsgroup.json [17:33:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:34:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:34:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:34:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:34:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T371742)', diff saved to https://phabricator.wikimedia.org/P68192 and previous config saved to /var/cache/conftool/dbconfig/20240829-173416-ladsgroup.json [17:34:20] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:35:08] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:35:32] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:38:43] (03PS8) 10Srishakatux: Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [17:39:27] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:40:36] (03CR) 10Srishakatux: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [17:40:59] (03CR) 10Amire80: [C:03+1] Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [17:46:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10103870 (10RobH) p:05Medium→03High #netops is working on scheduling a turn up call, but won't be able to do so until... [17:46:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10103877 (10RobH) [17:48:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T370903)', diff saved to https://phabricator.wikimedia.org/P68193 and previous config saved to /var/cache/conftool/dbconfig/20240829-174820-ladsgroup.json [17:48:22] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2051.codfw.wmnet with reason: host reimage [17:48:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [17:48:26] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:48:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [17:48:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T370903)', diff saved to https://phabricator.wikimedia.org/P68194 and previous config saved to /var/cache/conftool/dbconfig/20240829-174842-ladsgroup.json [17:50:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T370903)', diff saved to https://phabricator.wikimedia.org/P68195 and previous config saved to /var/cache/conftool/dbconfig/20240829-175053-ladsgroup.json [17:51:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2051.codfw.wmnet with reason: host reimage [17:52:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "compiler output is one of the "fixes currently broken run" that may look like no change. actually no change on contint." [puppet] - 10https://gerrit.wikimedia.org/r/1068801 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [17:55:19] (03CR) 10Dzahn: [V:03+1 C:03+2] "on releases2003:" [puppet] - 10https://gerrit.wikimedia.org/r/1068801 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [17:56:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T371742)', diff saved to https://phabricator.wikimedia.org/P68196 and previous config saved to /var/cache/conftool/dbconfig/20240829-175658-ladsgroup.json [17:57:03] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:57:10] (03CR) 10Dzahn: [C:03+2] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [18:00:30] (03PS1) 10Alexandros Kosiaris: Rename mw229[567] to wikikube-worker205[234] [puppet] - 10https://gerrit.wikimedia.org/r/1068833 (https://phabricator.wikimedia.org/T372878) [18:03:16] (03CR) 10CI reject: [V:04-1] Rename mw229[567] to wikikube-worker205[234] [puppet] - 10https://gerrit.wikimedia.org/r/1068833 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [18:03:59] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://phabricator.wikimedia.org/T359795#10103932" [puppet] - 10https://gerrit.wikimedia.org/r/1068801 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [18:04:25] (03CR) 10Dzahn: [C:03+2] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [18:05:47] (03CR) 10Alexandros Kosiaris: "18:01:26 1) profile::acme_chief on debian-11-x86_64 is expected to compile into a catalogue without dependency cycles" [puppet] - 10https://gerrit.wikimedia.org/r/1068833 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [18:06:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P68197 and previous config saved to /var/cache/conftool/dbconfig/20240829-180601-ladsgroup.json [18:06:42] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1068833 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [18:12:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P68198 and previous config saved to /var/cache/conftool/dbconfig/20240829-181205-ladsgroup.json [18:12:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2051.codfw.wmnet with OS bullseye [18:12:23] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10103958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [18:15:55] !log running homer after wikikube-worker2051 rename [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:19] (03CR) 10Andrew Bogott: [C:03+2] Put cloudcephosd1036 into service [puppet] - 10https://gerrit.wikimedia.org/r/1063861 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [18:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:21:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P68199 and previous config saved to /var/cache/conftool/dbconfig/20240829-182108-ladsgroup.json [18:23:33] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@abb06c4]: Deploy latest Analitycs Airflow DAGs to pickup T373402 [18:23:37] T373402: Update Commons Impact Metrics allow-list August 2024 - https://phabricator.wikimedia.org/T373402 [18:24:15] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@abb06c4]: Deploy latest Analitycs Airflow DAGs to pickup T373402 (duration: 00m 42s) [18:27:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P68200 and previous config saved to /var/cache/conftool/dbconfig/20240829-182713-ladsgroup.json [18:28:13] (03PS1) 10Andrew Bogott: Remove SNIs for internal hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1068842 [18:28:46] (03CR) 10Andrew Bogott: [C:03+2] Remove SNIs for internal hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1068842 (owner: 10Andrew Bogott) [18:29:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10104012 (10Jhancock.wm) @cmooney I got everything swapped around as requested and the new 1G link is on ge-0/0/43. I can confirm that it will... [18:36:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T370903)', diff saved to https://phabricator.wikimedia.org/P68201 and previous config saved to /var/cache/conftool/dbconfig/20240829-183616-ladsgroup.json [18:36:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [18:36:21] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:36:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [18:36:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T370903)', diff saved to https://phabricator.wikimedia.org/P68202 and previous config saved to /var/cache/conftool/dbconfig/20240829-183638-ladsgroup.json [18:38:15] (03PS1) 10Andrew Bogott: labtesthorizon: reuse ns0.openstack.codfw1dev hostnames for ldap [puppet] - 10https://gerrit.wikimedia.org/r/1068843 [18:38:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T370903)', diff saved to https://phabricator.wikimedia.org/P68203 and previous config saved to /var/cache/conftool/dbconfig/20240829-183848-ladsgroup.json [18:40:49] (03CR) 10Andrew Bogott: [C:03+2] labtesthorizon: reuse ns0.openstack.codfw1dev hostnames for ldap [puppet] - 10https://gerrit.wikimedia.org/r/1068843 (owner: 10Andrew Bogott) [18:42:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T371742)', diff saved to https://phabricator.wikimedia.org/P68204 and previous config saved to /var/cache/conftool/dbconfig/20240829-184220-ladsgroup.json [18:42:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:42:25] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:42:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:42:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T371742)', diff saved to https://phabricator.wikimedia.org/P68205 and previous config saved to /var/cache/conftool/dbconfig/20240829-184242-ladsgroup.json [18:45:31] (03CR) 10Ottomata: "I see, no if the app responds it should be ready to go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [18:46:25] (03CR) 10Ottomata: [C:03+1] "+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [18:51:39] (03PS2) 10Andrew Bogott: Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) [18:52:52] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2051.codfw.wmnet [18:52:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2051.codfw.wmnet [18:53:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P68206 and previous config saved to /var/cache/conftool/dbconfig/20240829-185355-ladsgroup.json [18:55:08] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10104094 (10kamila) [19:02:50] (03CR) 10Dzahn: [C:03+2] "yep. after looking at our dashboard panel, this makes a lot of sense" [puppet] - 10https://gerrit.wikimedia.org/r/1068744 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:06:11] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host sretest2002.codfw.wmnet [19:09:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P68207 and previous config saved to /var/cache/conftool/dbconfig/20240829-190902-ladsgroup.json [19:10:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T371742)', diff saved to https://phabricator.wikimedia.org/P68208 and previous config saved to /var/cache/conftool/dbconfig/20240829-191026-ladsgroup.json [19:10:31] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:24:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T370903)', diff saved to https://phabricator.wikimedia.org/P68209 and previous config saved to /var/cache/conftool/dbconfig/20240829-192409-ladsgroup.json [19:24:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [19:24:14] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:24:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [19:25:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P68210 and previous config saved to /var/cache/conftool/dbconfig/20240829-192533-ladsgroup.json [19:28:51] (03PS3) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) [19:29:04] (03PS4) 10Ssingh: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [19:32:46] (03PS5) 10Ssingh: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [19:34:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1229.eqiad.wmnet with reason: Maintenance [19:34:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1229.eqiad.wmnet with reason: Maintenance [19:34:34] (03PS6) 10Ssingh: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [19:34:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T370903)', diff saved to https://phabricator.wikimedia.org/P68211 and previous config saved to /var/cache/conftool/dbconfig/20240829-193436-ladsgroup.json [19:34:41] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:35:59] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [19:36:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T370903)', diff saved to https://phabricator.wikimedia.org/P68212 and previous config saved to /var/cache/conftool/dbconfig/20240829-193647-ladsgroup.json [19:38:45] (03PS1) 10Andrea Denisse: alert: Enable the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1064806 (https://phabricator.wikimedia.org/T372418) [19:40:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P68213 and previous config saved to /var/cache/conftool/dbconfig/20240829-194040-ladsgroup.json [19:48:43] jouncebot: nowandnext [19:48:43] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [19:48:43] In 0 hour(s) and 11 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T2000) [19:51:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P68214 and previous config saved to /var/cache/conftool/dbconfig/20240829-195154-ladsgroup.json [19:55:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T371742)', diff saved to https://phabricator.wikimedia.org/P68215 and previous config saved to /var/cache/conftool/dbconfig/20240829-195547-ladsgroup.json [19:55:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:55:52] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:56:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:56:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T371742)', diff saved to https://phabricator.wikimedia.org/P68216 and previous config saved to /var/cache/conftool/dbconfig/20240829-195609-ladsgroup.json [19:59:01] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10104246 (10wiki_willy) Hi @dcaro - just following up on this to see if you were ok with shipping these WMCS drives... [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240829T2000). [20:00:04] srishakatux, chlod, katherine_g, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] here [20:00:16] o/ here [20:00:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:02:20] here! [20:04:00] As much as I'd like to, I can deploy this evening! Hopefully another deployer appears [20:04:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10104251 (10cmooney) >>! In T370475#10104012, @Jhancock.wm wrote: > @cmooney I got everything swapped around as requested and the new 1G link i... [20:04:18] TheresNoTime: your irc nick seems quite appropriate tonight :) [20:04:23] ;P [20:04:43] (and i'm assuming you meant "i can't deploy") [20:04:57] Oh gosh, yes — "I can't deploy" [20:07:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P68217 and previous config saved to /var/cache/conftool/dbconfig/20240829-200701-ladsgroup.json [20:07:25] Good afternoon, srishakatux and the rest of the Americas :) [20:08:09] I haven't done backport deployments in a while. I'm in the right place, am I? [20:08:38] You are! Just waiting on another deployer as I'm busy this evening [20:08:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [20:08:59] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [20:08:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:15:57] TheresNoTime, still waiting?.. :) [20:16:32] Yes, let's ping the others again — RoanKattouw, urbanecm, cjming, kindrobot [20:16:42] what's up? [20:16:53] i guess i should deploy something [20:17:13] wow [20:17:14] so many patches [20:17:16] so little time [20:17:35] chlod: are you around too? [20:17:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2002.codfw.wmnet [20:17:41] yup, am around [20:17:42] good luck! [20:17:44] cool! [20:17:52] srishakatux: around for deployment? :) [20:17:54] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:18:22] (03CR) 10Urbanecm: [C:03+2] kawikisource: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064356 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:18:30] (03PS3) 10Chlod Alejandro: kaawiktionary: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064363 (https://phabricator.wikimedia.org/T368868) [20:18:32] (03CR) 10Urbanecm: [C:03+2] kaawiktionary: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064363 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:18:40] (03PS3) 10Chlod Alejandro: iglwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063916 (https://phabricator.wikimedia.org/T368868) [20:18:41] (03CR) 10Urbanecm: [C:03+2] iglwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063916 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:18:50] urandom yes! aharoni my reviewer is around too [20:18:55] cool [20:19:05] (03Merged) 10jenkins-bot: kawikisource: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064356 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:19:18] (03Merged) 10jenkins-bot: kaawiktionary: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064363 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:19:35] (03Merged) 10jenkins-bot: iglwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063916 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:19:55] srishakatux: aharoni: unfortunately, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1060895 cannot be deployed right now. the patch to master needs to be merged, and there need to be cherry-picks to the deployment-branches. if you need this backported, can you look at this? [20:20:34] plus, note changes of translations usually take forever to deploy. i can do that today, but at the end. if you can wait, i suggest just +2ing and merging, but ultimately up2you. [20:20:34] urbanecm it depends on the other configuration patch [20:20:57] But actually, it probably doesn't have to be. [20:20:59] aharoni: can you clarify why that is? [20:21:00] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2002 - cmooney@cumin1002" [20:21:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2002 - cmooney@cumin1002" [20:21:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:21:33] I've just given it +2. [20:21:39] (03PS3) 10Chlod Alejandro: mywikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063918 (https://phabricator.wikimedia.org/T368868) [20:21:41] (03CR) 10Urbanecm: [C:03+2] mywikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063918 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:21:57] So in about 20 minutes, it should be fully merged. [20:22:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T370903)', diff saved to https://phabricator.wikimedia.org/P68218 and previous config saved to /var/cache/conftool/dbconfig/20240829-202209-ladsgroup.json [20:22:10] aharoni: i'd still like to understand why the dependency was originally added there :) [20:22:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1233.eqiad.wmnet with reason: Maintenance [20:22:14] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:22:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1233.eqiad.wmnet with reason: Maintenance [20:22:26] (03Merged) 10jenkins-bot: mywikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063918 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:22:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T370903)', diff saved to https://phabricator.wikimedia.org/P68219 and previous config saved to /var/cache/conftool/dbconfig/20240829-202231-ladsgroup.json [20:22:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T371742)', diff saved to https://phabricator.wikimedia.org/P68220 and previous config saved to /var/cache/conftool/dbconfig/20240829-202238-ladsgroup.json [20:22:43] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:22:46] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1064356|kawikisource: re-add custom logos (T368868)]], [[gerrit:1064363|kaawiktionary: re-add custom logos (T368868)]], [[gerrit:1063916|iglwiki: add custom logos (T368868)]], [[gerrit:1063918|mywikisource: add custom logos (T368868)]] [20:22:50] urbanecm Because they are related to the same Phab task, to fix namespace names for the Mongolian Wikipedia. [20:22:50] T368868: Set logos for new wikis - https://phabricator.wikimedia.org/T368868 [20:23:07] aharoni: so just to link them together in some way, but no technical reason? [20:23:10] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [20:23:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10104340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest... [20:23:22] Yes. They do not technically depend on each other. [20:23:34] makes sense, thanks for the clarification. [20:24:15] aharoni: can you also upload the wmf.x cherrypicks? or srishakatux [20:24:24] (can be done before it merges to master) [20:24:33] Let me see.. [20:24:52] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1656 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [20:25:08] urbanecm - cherry-pick to wmf/1.42.0-wmf.26 ? [20:25:44] Oh no, probably wmf/1.42.0-wmf.20 [20:25:53] neither [20:25:58] wmf/1.43.0-wmf.20? [20:25:59] https://versions.toolforge.org/ tells you which branches we currently use [20:26:14] aharoni: yep :) [20:26:30] Thanks. I haven't done this in many months. [20:26:58] !log urbanecm@deploy1003 urbanecm, chlod: Backport for [[gerrit:1064356|kawikisource: re-add custom logos (T368868)]], [[gerrit:1064363|kaawiktionary: re-add custom logos (T368868)]], [[gerrit:1063916|iglwiki: add custom logos (T368868)]], [[gerrit:1063918|mywikisource: add custom logos (T368868)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:07] chlod: please test your patches (first 4) [20:27:08] (03PS1) 10Amire80: Modify namespace translation for mnwiki [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) [20:27:11] testing now [20:27:21] aharoni: no worries, i am happy to help :). [20:27:56] (03CR) 10Urbanecm: [C:03+2] kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:28:03] (03CR) 10Urbanecm: kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:28:06] logo changes, an excuse to try out https://logos-purge.toolforge.org perhaps.. [20:28:08] (03PS3) 10Chlod Alejandro: kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) [20:28:11] (03CR) 10Urbanecm: [C:03+2] kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:28:19] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:28:20] (03PS3) 10Chlod Alejandro: bewwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063920 (https://phabricator.wikimedia.org/T368868) [20:28:24] (03CR) 10Urbanecm: [C:03+2] bewwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063920 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:28:46] urbanecm: first 4 patches look good! :D [20:28:52] chlod: awesome! proceeding [20:28:53] !log urbanecm@deploy1003 urbanecm, chlod: Continuing with sync [20:28:54] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:29:01] (03PS6) 10Kgraessle: Enable AutoModerator on id.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) [20:29:05] (03Merged) 10jenkins-bot: kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:29:14] (03Merged) 10jenkins-bot: bewwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063920 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:29:24] TheresNoTime: didn't know that tool! but theoretically, this should work without purges, as the URL is new [20:29:44] (sorry! :D ) [20:29:48] Aw :P [20:29:59] maybe in a future logo update :3 [20:30:35] katherine_g: hello! i'll take your patch soon [20:30:41] hello, sounds good! [20:30:48] (03CR) 10Urbanecm: [C:03+2] Enable AutoModerator on id.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [20:30:52] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 142600 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [20:31:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T370903)', diff saved to https://phabricator.wikimedia.org/P68221 and previous config saved to /var/cache/conftool/dbconfig/20240829-203120-ladsgroup.json [20:31:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:32:15] (03Merged) 10jenkins-bot: Enable AutoModerator on id.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065265 (https://phabricator.wikimedia.org/T365792) (owner: 10Kgraessle) [20:33:34] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064356|kawikisource: re-add custom logos (T368868)]], [[gerrit:1064363|kaawiktionary: re-add custom logos (T368868)]], [[gerrit:1063916|iglwiki: add custom logos (T368868)]], [[gerrit:1063918|mywikisource: add custom logos (T368868)]] (duration: 10m 48s) [20:33:38] T368868: Set logos for new wikis - https://phabricator.wikimedia.org/T368868 [20:33:52] urbanecm - is the cherry-pick done correctly? [20:34:04] aharoni: appers to be at first sight [20:34:23] yep, no issues [20:34:46] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1063919|kuswiki: add custom logos (T368868)]], [[gerrit:1063920|bewwiki: add custom logos (T368868)]], [[gerrit:1065265|Enable AutoModerator on id.wiki (T365792)]] [20:34:51] T365792: Enable AutoModerator on id.wiki - https://phabricator.wikimedia.org/T365792 [20:36:55] !log urbanecm@deploy1003 kgraessle, urbanecm, chlod: Backport for [[gerrit:1063919|kuswiki: add custom logos (T368868)]], [[gerrit:1063920|bewwiki: add custom logos (T368868)]], [[gerrit:1065265|Enable AutoModerator on id.wiki (T365792)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:20] thanks, looks good to sync [20:37:35] also all good here :) [20:37:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P68222 and previous config saved to /var/cache/conftool/dbconfig/20240829-203745-ladsgroup.json [20:38:00] sounds cool! [20:38:01] !log urbanecm@deploy1003 kgraessle, urbanecm, chlod: Continuing with sync [20:38:18] (03PS9) 10Srishakatux: Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [20:38:27] (03PS2) 10C. Scott Ananian: Turn on Parsoid Read Views for eo/sv/fi wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) [20:38:29] (03CR) 10Urbanecm: [C:03+2] Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [20:38:33] (03CR) 10Urbanecm: [C:03+2] Turn on Parsoid Read Views for eo/sv/fi wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) (owner: 10C. Scott Ananian) [20:39:21] (03Merged) 10jenkins-bot: Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [20:39:23] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for eo/sv/fi wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) (owner: 10C. Scott Ananian) [20:40:16] o [20:40:22] i'm here :) [20:40:47] hey! [20:40:55] just need couple of minutes for a sync to finish :)) [20:42:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104390 (10Jclark-ctr) [20:42:11] (03PS1) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) [20:42:36] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063919|kuswiki: add custom logos (T368868)]], [[gerrit:1063920|bewwiki: add custom logos (T368868)]], [[gerrit:1065265|Enable AutoModerator on id.wiki (T365792)]] (duration: 07m 50s) [20:42:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068821 (https://phabricator.wikimedia.org/T372810) (owner: 10C. Scott Ananian) [20:42:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [20:42:43] T368868: Set logos for new wikis - https://phabricator.wikimedia.org/T368868 [20:42:43] T365792: Enable AutoModerator on id.wiki - https://phabricator.wikimedia.org/T365792 [20:42:49] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1068821|Turn on Parsoid Read Views for eo/sv/fi wikivoyage (T372810)]], [[gerrit:1060893|Add project talk aliases for mnwiki (T366271)]] [20:42:56] T372810: Deploy Parsoid Read Views to eo/sv/fi wikivoyages - https://phabricator.wikimedia.org/T372810 [20:42:57] T366271: Change Wikipedia: and Wikipedia_talk: namespaces for Mongolian (for Mongolian Wikipedia) - https://phabricator.wikimedia.org/T366271 [20:43:13] katherine_g: chlod: deployed! :) [20:43:28] thanks, urbanecm! :D [20:43:32] thanks! looks good on my end [20:43:50] good [20:44:44] also all good here :) [20:44:47] c [20:44:49] !log urbanecm@deploy1003 urbanecm, srishakatux, cscott: Backport for [[gerrit:1068821|Turn on Parsoid Read Views for eo/sv/fi wikivoyage (T372810)]], [[gerrit:1060893|Add project talk aliases for mnwiki (T366271)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:44:51] great! [20:44:59] (03PS2) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) [20:45:06] srishakatux: cscott: can you test yours? :) [20:45:10] aharoni: ^ [20:45:13] ok will do! [20:45:26] looking [20:45:28] (03CR) 10Urbanecm: [C:03+2] Modify namespace translation for mnwiki [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [20:46:03] I need to enable that thing in the browser extension, right? What do I have to select there? [20:46:26] mwdebug* something? or k8s? [20:46:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P68223 and previous config saved to /var/cache/conftool/dbconfig/20240829-204628-ladsgroup.json [20:46:52] urbanecm: looks good, thanks! [20:47:00] ack, thanks [20:47:13] aharoni: AFAIK either of the codfw or eqiad ones will work [20:47:19] aharoni: doesn't matter, but leave it at k8s [20:47:34] i always use eqiad, am i evil? [20:47:49] cscott: nope. you just are here for a while :D [20:47:54] Wikipedia tested, looks good. [20:47:57] Testing Wiktionary... [20:48:00] it's geographically closest to me i think [20:48:45] urbanecm, srishakatux OK, so I see the new namespace names in Special:Search on both , which is good. [20:49:08] (03PS1) 10Cathal Mooney: Change preseed matching pattern for sretest2xxx to include 200[1-2 [puppet] - 10https://gerrit.wikimedia.org/r/1068866 [20:49:09] However, to make them perfectly correct, the core patch needs to be deployed, too, because it includes grammar rules for dsplaying. [20:49:26] displaying the namespace names with the correct genitive case ending. [20:49:28] yep yep :) [20:49:33] soon [20:49:35] in CI :) [20:49:39] But we are on the right track. Thank you for the assistance. [20:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:49:45] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2002.codfw.wmnet with OS bookworm [20:49:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10104426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002... [20:51:12] !log urbanecm@deploy1003 urbanecm, srishakatux, cscott: Continuing with sync [20:52:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P68224 and previous config saved to /var/cache/conftool/dbconfig/20240829-205252-ladsgroup.json [20:55:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [20:55:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [20:55:58] (03CR) 10RobH: [C:03+2] Change preseed matching pattern for sretest2xxx to include 200[1-2 [puppet] - 10https://gerrit.wikimedia.org/r/1068866 (owner: 10Cathal Mooney) [20:56:05] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068821|Turn on Parsoid Read Views for eo/sv/fi wikivoyage (T372810)]], [[gerrit:1060893|Add project talk aliases for mnwiki (T366271)]] (duration: 13m 16s) [20:56:10] T372810: Deploy Parsoid Read Views to eo/sv/fi wikivoyages - https://phabricator.wikimedia.org/T372810 [20:56:10] T366271: Change Wikipedia: and Wikipedia_talk: namespaces for Mongolian (for Mongolian Wikipedia) - https://phabricator.wikimedia.org/T366271 [20:58:32] aharoni: https://integration.wikimedia.org/ci/job/mwgate-node18/57099/console ... [20:58:35] ci failure [20:59:53] Fixing... let's see if we have time... [20:59:58] Such a silly thing [21:00:00] window ends in a minute :) [21:00:05] ci takes 20 mins at least [21:00:06] :( [21:00:14] so, i'd reschedule [21:00:15] Well, I'll fix anyway [21:00:18] OK [21:00:33] (03CR) 10Urbanecm: Modify namespace translation for mnwiki [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [21:00:51] sorry :) [21:01:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P68225 and previous config saved to /var/cache/conftool/dbconfig/20240829-210135-ladsgroup.json [21:02:05] (03PS6) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [21:02:05] (03PS5) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [21:03:00] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [21:03:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [21:03:48] (03CR) 10CI reject: [V:04-1] Modify namespace translation for mnwiki [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [21:04:05] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [21:04:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10104450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest... [21:05:33] cscott, i'm still around if anything needs testing. [21:05:51] oh, nvm, I just saw that it is already live. [21:06:16] (03CR) 10CI reject: [V:04-1] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [21:06:31] subbu: yeah looks like a smooth deploy. If you remember the page on eowikivoyage which had the bogus image option issue, you could double check that it renders correctly now. [21:06:42] I was just updating our known issues list and removed that as fixed [21:07:10] ok .. i had already confirmed those fixed yday after group1 deploy and visual diff tests cleared those. [21:07:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage [21:08:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T371742)', diff saved to https://phabricator.wikimedia.org/P68226 and previous config saved to /var/cache/conftool/dbconfig/20240829-210759-ladsgroup.json [21:08:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [21:08:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:08:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [21:08:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T371742)', diff saved to https://phabricator.wikimedia.org/P68227 and previous config saved to /var/cache/conftool/dbconfig/20240829-210822-ladsgroup.json [21:10:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage [21:10:30] subbu: cool! [21:11:48] (03Abandoned) 10Amire80: Modify namespace translation for mnwiki [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [21:12:56] (03Restored) 10Amire80: Modify namespace translation for mnwiki [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [21:13:00] (03PS7) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [21:13:00] (03PS6) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [21:13:40] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:45] urbanecm srishakatux I fixed the original patch in the master branch, and it looks OK now. How do I update the cherry-pick now? [21:16:10] (03PS2) 10Amire80: Modify namespace translation for Mongolian (mn) [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) [21:16:22] Oh, I think I've managed to do it. [21:16:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T370903)', diff saved to https://phabricator.wikimedia.org/P68228 and previous config saved to /var/cache/conftool/dbconfig/20240829-211642-ladsgroup.json [21:16:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [21:16:47] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:16:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [21:17:05] (03CR) 10CI reject: [V:04-1] [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 (owner: 10Ladsgroup) [21:17:15] (03CR) 10CI reject: [V:04-1] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [21:19:55] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2002.codfw.wmnet with OS bookworm [21:20:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10104491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002... [21:22:21] (03CR) 10Jforrester: [C:03+1] "Thanks for remembering; I forgot!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120 (owner: 10Ladsgroup) [21:23:48] (03PS8) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [21:23:48] (03PS7) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 [21:23:57] (03PS1) 10Scott French: k8s-controller-sidecars: adopt securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) [21:24:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:26:53] aharoni: just cherry-pick again [21:27:01] but i see you found this/other way :) [21:27:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1246.eqiad.wmnet with reason: Maintenance [21:27:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1246.eqiad.wmnet with reason: Maintenance [21:27:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T370903)', diff saved to https://phabricator.wikimedia.org/P68229 and previous config saved to /var/cache/conftool/dbconfig/20240829-212727-ladsgroup.json [21:27:32] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:27:40] (03CR) 10CI reject: [V:04-1] [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 (owner: 10Ladsgroup) [21:27:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [21:27:47] (aharoni: "just cherrypick again" only applies if you use the button in gerrit web UI. command-line way wouldn't do that.) [21:30:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T371742)', diff saved to https://phabricator.wikimedia.org/P68230 and previous config saved to /var/cache/conftool/dbconfig/20240829-213015-ladsgroup.json [21:30:21] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:32:25] (03CR) 10Scott French: "Context: When mw2297 was cordoned earlier today, k8s-controller-sidecars was rescheduled, and is now failing to back up with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [21:35:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T370903)', diff saved to https://phabricator.wikimedia.org/P68231 and previous config saved to /var/cache/conftool/dbconfig/20240829-213526-ladsgroup.json [21:35:31] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:41:54] (03PS2) 10Jforrester: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy) [21:42:01] (03PS3) 10Jforrester: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy) [21:44:52] (03CR) 10Jforrester: [C:03+1] Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy) [21:45:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:45:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [21:45:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [21:45:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P68232 and previous config saved to /var/cache/conftool/dbconfig/20240829-214523-ladsgroup.json [21:49:09] (03PS2) 10Andrew Bogott: keystone::apache: include auth_openidc [puppet] - 10https://gerrit.wikimedia.org/r/1068260 (https://phabricator.wikimedia.org/T359590) [21:49:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068260 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [21:50:28] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [21:50:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P68233 and previous config saved to /var/cache/conftool/dbconfig/20240829-215034-ladsgroup.json [21:50:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10104650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest... [21:53:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [21:53:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [21:54:13] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [21:54:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104653 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [22:00:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P68234 and previous config saved to /var/cache/conftool/dbconfig/20240829-220030-ladsgroup.json [22:02:13] (03CR) 10Andrew Bogott: [C:03+2] keystone::apache: include auth_openidc [puppet] - 10https://gerrit.wikimedia.org/r/1068260 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [22:05:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P68235 and previous config saved to /var/cache/conftool/dbconfig/20240829-220541-ladsgroup.json [22:10:23] !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php testwiki # T183490 [22:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:31] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:15:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T371742)', diff saved to https://phabricator.wikimedia.org/P68236 and previous config saved to /var/cache/conftool/dbconfig/20240829-221537-ladsgroup.json [22:15:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [22:15:42] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:15:47] (03PS1) 10Andrew Bogott: keystone::apache: install mod_auth_openidc package [puppet] - 10https://gerrit.wikimedia.org/r/1068876 (https://phabricator.wikimedia.org/T359590) [22:15:49] (03PS1) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 [22:15:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [22:16:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T371742)', diff saved to https://phabricator.wikimedia.org/P68237 and previous config saved to /var/cache/conftool/dbconfig/20240829-221559-ladsgroup.json [22:16:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068876 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [22:16:19] (03PS2) 10Andrew Bogott: keystone::apache: install mod_auth_openidc package [puppet] - 10https://gerrit.wikimedia.org/r/1068876 (https://phabricator.wikimedia.org/T359590) [22:17:34] (03CR) 10RLazarus: [C:03+1] "Thanks for this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [22:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:19:28] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bookworm [22:19:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10104673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002... [22:20:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T370903)', diff saved to https://phabricator.wikimedia.org/P68238 and previous config saved to /var/cache/conftool/dbconfig/20240829-222048-ladsgroup.json [22:20:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:20:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:21:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:21:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068876 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [22:25:51] (03CR) 10Andrew Bogott: [C:03+2] keystone::apache: install mod_auth_openidc package [puppet] - 10https://gerrit.wikimedia.org/r/1068876 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [22:28:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [22:28:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [22:28:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T370903)', diff saved to https://phabricator.wikimedia.org/P68239 and previous config saved to /var/cache/conftool/dbconfig/20240829-222824-ladsgroup.json [22:28:29] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:29:14] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10104682 (10Dwisehaupt) 05Open→03Resolved a:03Dwisehaupt This looks like it's all complete and we are working well with the new hosts... [22:36:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T371742)', diff saved to https://phabricator.wikimedia.org/P68240 and previous config saved to /var/cache/conftool/dbconfig/20240829-223602-ladsgroup.json [22:36:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:39:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T370903)', diff saved to https://phabricator.wikimedia.org/P68241 and previous config saved to /var/cache/conftool/dbconfig/20240829-223949-ladsgroup.json [22:39:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:44:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [22:45:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [22:45:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [22:45:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [22:47:01] (03PS1) 10Zabe: Do not treat failed autocreations on closed wikis as errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 [22:47:12] (03PS2) 10Zabe: Do not treat log autocreations on closed wikis as diagnostic errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 [22:47:24] (03PS3) 10Zabe: Do not log failed autocreations on closed wikis as diagnostic errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 [22:49:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:11] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P68242 and previous config saved to /var/cache/conftool/dbconfig/20240829-225109-ladsgroup.json [22:54:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P68243 and previous config saved to /var/cache/conftool/dbconfig/20240829-225456-ladsgroup.json [22:55:47] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:03:39] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:09] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:09] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10104772 (10VRiley-WMF) Ports: 27/28 patch to cr2-eqiad:xe-3/0/3 - Cable ID 1-8292024 Ports 25/26 patch to cr1-eqiad:xe-... [23:04:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10104777 (10VRiley-WMF) [23:04:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10104778 (10VRiley-WMF) a:03cmooney [23:06:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P68244 and previous config saved to /var/cache/conftool/dbconfig/20240829-230616-ladsgroup.json [23:10:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P68245 and previous config saved to /var/cache/conftool/dbconfig/20240829-231003-ladsgroup.json [23:21:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T371742)', diff saved to https://phabricator.wikimedia.org/P68246 and previous config saved to /var/cache/conftool/dbconfig/20240829-232124-ladsgroup.json [23:21:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [23:21:29] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:21:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [23:25:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T370903)', diff saved to https://phabricator.wikimedia.org/P68247 and previous config saved to /var/cache/conftool/dbconfig/20240829-232510-ladsgroup.json [23:25:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [23:25:15] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:25:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [23:25:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:25:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:25:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T370903)', diff saved to https://phabricator.wikimedia.org/P68248 and previous config saved to /var/cache/conftool/dbconfig/20240829-232548-ladsgroup.json [23:28:07] PROBLEM - BFD status on cr1-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:28:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T370903)', diff saved to https://phabricator.wikimedia.org/P68249 and previous config saved to /var/cache/conftool/dbconfig/20240829-232810-ladsgroup.json [23:33:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [23:33:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10104867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [23:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068886 [23:39:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068886 (owner: 10TrainBranchBot) [23:43:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P68250 and previous config saved to /var/cache/conftool/dbconfig/20240829-234317-ladsgroup.json [23:44:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [23:44:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [23:44:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T371742)', diff saved to https://phabricator.wikimedia.org/P68251 and previous config saved to /var/cache/conftool/dbconfig/20240829-234420-ladsgroup.json [23:44:25] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:52:19] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, 13Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10104888 (10tstarling) For configuration propagation, it turns out to be better to have the Command act as a factory, so we will have: `php $command... [23:58:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P68252 and previous config saved to /var/cache/conftool/dbconfig/20240829-235824-ladsgroup.json