[00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1089903 [00:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1089903 (owner: 10TrainBranchBot) [00:44:25] (03CR) 10Jforrester: [C:03+1] "Just for safety, I'm going to merge this tomorrow (after the train cut) rather than 2 hours before it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [00:45:04] (03CR) 10Jforrester: [C:03+1] "Bah, never mind me, obviously this is the prod version and not test one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [01:00:58] (03CR) 10Ssingh: "What's the current systemd-analyze score? Looks good but curious to see." [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [01:08:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1089910 [01:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1089910 (owner: 10TrainBranchBot) [01:14:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1089903 (owner: 10TrainBranchBot) [01:44:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1089910 (owner: 10TrainBranchBot) [02:06:20] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [02:06:20] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [02:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1089924 (https://phabricator.wikimedia.org/T375662) [02:08:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1089924 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [02:16:20] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [02:16:20] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [02:19:03] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1089924 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [02:26:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:27:28] 06SRE, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Restore commons-l subscribers removed due to fat finger "remove all members" - https://phabricator.wikimedia.org/T379519#10310979 (10Ladsgroup) Thank you @jcrespo 😍 [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:17] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089939 [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T0300) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:20] (03PS2) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089939 [03:26:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T0400) [04:20:30] (03PS1) 10KartikMistry: Update recommendation api to 2024-11-11-200548-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379037) [04:28:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:00] (03PS1) 10TChin: EventStreamConfig: Enable Hive Ingestion for most streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T0500) [05:01:54] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.28 (duration: 01m 52s) [05:08:06] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:46] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:30:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:58] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:51:28] (03PS2) 10Andrea Denisse: grafana: Fix login redirection to preserve dashboard context [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) [05:51:28] (03CR) 10Andrea Denisse: "I tested this change in the grafana-next.wikimedia.org host by following these steps:" [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [05:55:32] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4491/co" [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [05:56:35] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1088611/4491/" [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [06:33:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T0700) [07:00:05] marostegui, Amir1, and arnaudb: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T0700). [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:12:12] (03CR) 10DCausse: [C:04-1] "should this be marked as `Depends-On: Iff5d89bea1034bc3386f96dff2863fa1f38fa04a`? Wikis are still running `1.44.0-wmf.2` which does have t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [07:18:01] (03PS5) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [07:29:50] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:30:13] (03CR) 10Muehlenhoff: [C:03+2] Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1088522 (owner: 10Muehlenhoff) [07:31:39] (03PS1) 10Brouberol: airflow: fix configuration keys containing the analytics-hadoop cluster name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090398 (https://phabricator.wikimedia.org/T377602) [07:31:40] (03PS1) 10Brouberol: airflow: add missing python dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090399 (https://phabricator.wikimedia.org/T377602) [07:31:42] (03PS1) 10Brouberol: airflow: enable yarn log aggregation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090400 (https://phabricator.wikimedia.org/T377602) [07:33:12] (03PS2) 10Brouberol: airflow: add missing python dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090399 (https://phabricator.wikimedia.org/T377602) [07:33:13] (03PS2) 10Brouberol: airflow: enable yarn log aggregation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090400 (https://phabricator.wikimedia.org/T377602) [07:35:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:07] (03PS2) 10Brouberol: airflow: fix configuration keys containing the analytics-hadoop cluster name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090398 (https://phabricator.wikimedia.org/T377602) [07:41:08] (03PS3) 10Brouberol: airflow: add missing python dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090399 (https://phabricator.wikimedia.org/T377602) [07:41:08] (03PS3) 10Brouberol: airflow: enable yarn log aggregation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090400 (https://phabricator.wikimedia.org/T377602) [07:52:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CirrusSearch] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089230 (https://phabricator.wikimedia.org/T378664) (owner: 10Urbanecm) [07:52:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:52:25] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [07:52:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:53:47] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti-test2003 [07:53:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti-test2003 [07:55:15] (03CR) 10Fabfur: "→ Overall exposure level for haproxykafka.service: 4.4 OK 🙂" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [07:56:22] (03CR) 10Fabfur: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [07:56:24] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1089605 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T0800). [08:00:05] pfischer and urbanecm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:09] i can deploy today [08:00:11] pfischer: morning! [08:00:39] PROBLEM - Host ganeti-test2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:01:02] (03CR) 10Urbanecm: [C:03+2] Fix WeightedTagsUpdater [extensions/CirrusSearch] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089230 (https://phabricator.wikimedia.org/T378664) (owner: 10Urbanecm) [08:02:04] ^ ganeti-test2003 is expected, WIP [08:02:11] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:02:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:02:41] (03CR) 10DCausse: [C:03+1] "nevermind, just saw that this patch is about to be deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:02:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:02:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:02:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:03:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P71001 and previous config saved to /var/cache/conftool/dbconfig/20241112-080303-arnaudb.json [08:03:07] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:03:21] (03CR) 10Urbanecm: "Correct, the plan is to backport the fix to CirrusSearch first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:04:57] !log installing apache security updates [08:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:05:47] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:06:11] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:34] pfischer: around? :) [08:09:49] (03PS1) 10Fabfur: Revert "hiera: enable haproxykafka on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1090415 [08:17:37] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 219.48 ms [08:17:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [08:17:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311226 (10ops-monitoring-bot) Draining ganeti1009.eqiad.wmnet of running VMs [08:18:34] (03CR) 10Fabfur: [C:03+2] Revert "hiera: enable haproxykafka on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1090415 (owner: 10Fabfur) [08:19:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [08:19:55] (03Merged) 10jenkins-bot: Fix WeightedTagsUpdater [extensions/CirrusSearch] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089230 (https://phabricator.wikimedia.org/T378664) (owner: 10Urbanecm) [08:21:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1089230|Fix WeightedTagsUpdater (T378664 T378983)]] [08:21:04] T378664: [wmf.1] refreshLinkRecommendations.php - Unable to deliver all events: 400: Bad Request - https://phabricator.wikimedia.org/T378664 [08:21:04] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [08:24:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311235 (10MoritzMuehlenhoff) [08:24:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:25:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet [08:26:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311237 (10ops-monitoring-bot) Draining ganeti1015.eqiad.wmnet of running VMs [08:27:33] (03PS1) 10Slyngshede: Account blocking: blocking should not fail if account is not blocked [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) [08:28:00] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1089230|Fix WeightedTagsUpdater (T378664 T378983)]] (duration: 06m 59s) [08:28:06] T378664: [wmf.1] refreshLinkRecommendations.php - Unable to deliver all events: 400: Bad Request - https://phabricator.wikimedia.org/T378664 [08:28:06] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [08:28:11] waiting for pfischer for the config change [08:28:55] FIRING: MaxConntrack: Max conntrack at 98.57% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [08:29:00] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Pwaigi - https://phabricator.wikimedia.org/T379225#10311244 (10PWaigi-WMF) @MatthewVernon, I have access now; I am closing this ticket. Thanks. [08:29:07] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Pwaigi - https://phabricator.wikimedia.org/T379225#10311245 (10PWaigi-WMF) 05Open→03Resolved [08:31:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1015.eqiad.wmnet [08:32:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet [08:32:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311249 (10ops-monitoring-bot) Draining ganeti1015.eqiad.wmnet of running VMs [08:33:55] RESOLVED: MaxConntrack: Max conntrack at 98.57% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [08:34:36] urbanecm: I am around now, sorry for the delay. [08:34:54] pfischer: great! let's try to re-enable then [08:34:58] (03PS2) 10Peter Fischer: CirrusSearch: re-enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) [08:35:01] (03CR) 10Urbanecm: [C:03+2] CirrusSearch: re-enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:35:43] (03Merged) 10jenkins-bot: CirrusSearch: re-enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:36:21] (03PS1) 10Fabfur: hiera: enable haproxykafka on cp5017 for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1090424 (https://phabricator.wikimedia.org/T378578) [08:36:29] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1089826|CirrusSearch: re-enable offloading weighted tags via EventBus (T378983)]] [08:36:34] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [08:37:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090424 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:38:36] !log urbanecm@deploy2002 pfischer, urbanecm: Backport for [[gerrit:1089826|CirrusSearch: re-enable offloading weighted tags via EventBus (T378983)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:38:42] okay, we're at mwdebug now [08:38:54] pfischer: if i issue an event from mwdebug, are you able to verify it comes through? [08:39:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:25] Yes, I’ll tap into the kafka topic, one sec. [08:39:39] waiting [08:39:51] RECOVERY - Host ganeti-test2003 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [08:40:14] urbanecm: ready [08:40:27] ok, triggering [08:40:48] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:07] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on cp5017 for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1090424 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:43:33] pfischer: there should be an addition now [08:43:46] (testwiki, Apollo_11, id 160445) [08:43:50] urbanecm: yes, confirmed [08:44:08] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T379233#10311282 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was a lingering issue caused by an interface name change... [08:44:19] great! progress :). let me try to trigger a removal now [08:44:58] pfischer: we should now have a removal [08:45:16] (testwiki, Nobelium, 119493) [08:45:52] urbanecm: ✅ [08:46:10] that sounds like it works now! i don't think there are any other operations besides those two, are there? [08:46:40] urbanecm: No, I’ll just check if our update pipeline ingested those two as expected. [08:46:53] okay, thanks. waiting. [08:47:15] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on cp5017 for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1090424 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:49:20] i do not see them processed when running queries via `Special:Search`, but that might very well be OK [08:49:38] (03CR) 10Brouberol: airflow: add airflow-wmde files (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [08:53:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090425 (https://phabricator.wikimedia.org/T375662) [08:53:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090425 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [08:53:50] pfischer: from my end, Nobelium disappeared (expected), still no Apollo 11 (not expected) [08:53:53] urbanecm: I can’t find them in the inter-update-pipeline-topic either, but the app keeps running, so obviously no crashes are caused and we do get the kafka records now [08:54:18] so even if we have to fix sth. in our app, you should be fine [08:54:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:52] pfischer: hmm, i'm reluctant to proceed, given what happened last time. do you think we should go ahead? or maybe it would make sense to keep it as true at testwiki, until we're certain it works end to end? [08:56:09] (03Abandoned) 10Brouberol: airflow: release airflow 2.10.3 on our test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088575 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [08:56:15] (03PS8) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [08:56:15] (03PS1) 10Elukey: Move kartotherian-k8s-ssl LVS endpoint to "production" state [puppet] - 10https://gerrit.wikimedia.org/r/1090426 (https://phabricator.wikimedia.org/T378944) [08:56:40] urbanecm: found the events in the intermediate topic now, didn’t went back far enough. [08:56:50] ah [08:57:05] urbanecm: …and forgot about the 10min delay [08:57:19] it's 10 mins now since the addition though? [08:57:57] according to IRC, the event was emitted at :43, now it's :57, so by now, it should be ingested? [08:58:07] Yes, and they were marked as rev_based [08:58:29] i'm not sure what that means [08:59:04] Ah, rev_based means, we try to merge those events with other revsion-related events in a 10min window [08:59:09] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4492/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090426 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [08:59:14] gotcha [08:59:55] urbanecm: can you query the results now via Special:Search? [09:00:23] pfischer: yes, removal was processed correctly, but the addition is not visible [09:00:36] which is what is confusing me, because as far as i understand this, both SHOULD be visible by now [09:01:24] `pageid:160445 hasrecommendation:link` and `pageid:119493 hasrecommendation:link` are the queries i'm using [09:03:19] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090425 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [09:03:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P71002 and previous config saved to /var/cache/conftool/dbconfig/20241112-090329-arnaudb.json [09:03:33] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:07:37] still no change [09:08:14] pfischer: what do you think? as of now, i'm heavily leaning towards reverting, this feels too suspicious ATM. happy to keep things at eventgate for testwiki. [09:10:07] urbanecm: hm, I tried to find sth. in the logs but couldn’t, our app runs without crashes so it no obvious bug. Sure, let’s revert and investigate, at least we have some reproducers now. [09:10:19] !log urbanecm@deploy2002 Sync cancelled. [09:10:25] reverting [09:10:34] (03PS1) 10TrainBranchBot: Revert "CirrusSearch: re-enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090428 [09:10:35] (03CR) 10TrainBranchBot: "urbanecm@deploy2002 created a revert of this change as I63fd4f53fbfd7d48d9adbaf09f587999474c37e6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089826 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [09:10:55] (03CR) 10Urbanecm: [C:03+2] Revert "CirrusSearch: re-enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090428 (owner: 10TrainBranchBot) [09:11:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090428 (owner: 10TrainBranchBot) [09:11:38] (03Merged) 10jenkins-bot: Revert "CirrusSearch: re-enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090428 (owner: 10TrainBranchBot) [09:11:53] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1090428|Revert "CirrusSearch: re-enable offloading weighted tags via EventBus"]] [09:12:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1090426 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [09:14:00] !log urbanecm@deploy2002 trainbranchbot, urbanecm: Backport for [[gerrit:1090428|Revert "CirrusSearch: re-enable offloading weighted tags via EventBus"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:14:04] !log urbanecm@deploy2002 trainbranchbot, urbanecm: Continuing with sync [09:17:43] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:18:23] (03CR) 10Brouberol: airflow: add airflow-wmde files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [09:18:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P71004 and previous config saved to /var/cache/conftool/dbconfig/20241112-091836-arnaudb.json [09:18:39] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090428|Revert "CirrusSearch: re-enable offloading weighted tags via EventBus"]] (duration: 06m 46s) [09:18:46] pfischer: okay, revert is deployed [09:22:45] pfischer: and posted https://phabricator.wikimedia.org/T377150#10311378 with a summary [09:23:00] (03CR) 10Brouberol: airflow: add airflow-wmde files (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [09:23:31] pfischer: leaving for now, please ping me if i can help somehow. [09:25:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:30:10] (03PS1) 10TChin: flink-app: Add default checkpointing config for Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) [09:31:13] (03CR) 10Elukey: [V:03+1 C:03+2] Move kartotherian-k8s-ssl LVS endpoint to "production" state [puppet] - 10https://gerrit.wikimedia.org/r/1090426 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [09:31:21] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600 (10MoritzMuehlenhoff) 03NEW [09:33:12] (03PS9) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [09:33:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P71005 and previous config saved to /var/cache/conftool/dbconfig/20241112-093343-arnaudb.json [09:35:27] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10311415 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:36:40] (03PS1) 10JMeybohm: preseed: Migrate wikikube-ctrl1* to containerd partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1090432 (https://phabricator.wikimedia.org/T377876) [09:38:34] (03PS1) 10JMeybohm: wikikube-staging: Remove obsolete docker hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1090433 (https://phabricator.wikimedia.org/T362408) [09:39:06] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090433 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:39:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:28] (03CR) 10Brouberol: airflow: add airflow-wmde files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [09:41:39] !log update d-i netboot image for 12.8 point release T379600 [09:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:42] T379600: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600 [09:41:49] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10311435 (10MoritzMuehlenhoff) [09:45:54] (03PS6) 10Muehlenhoff: Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) [09:46:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [09:48:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P71006 and previous config saved to /var/cache/conftool/dbconfig/20241112-094851-arnaudb.json [09:48:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:51:08] (03CR) 10Muehlenhoff: [C:03+2] Remove Puppet code for legacy udpmixecho/ircecho setup [puppet] - 10https://gerrit.wikimedia.org/r/1089652 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [09:52:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090435 (https://phabricator.wikimedia.org/T375662) [09:52:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090435 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [09:53:03] (03CR) 10Btullis: [C:03+1] airflow: fix configuration keys containing the analytics-hadoop cluster name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090398 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [09:53:24] (03CR) 10Btullis: [C:03+1] "Ah yes, we added this recently." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090399 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [09:53:47] (03CR) 10Btullis: [C:03+1] "Great!. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090400 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [09:55:14] (03CR) 10Btullis: [C:03+2] Move start day of dump_fillin_wd job from the 7th to the 10th of the month [puppet] - 10https://gerrit.wikimedia.org/r/1088599 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [09:55:36] (03PS1) 10Muehlenhoff: tcpircbot: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1090437 [09:55:53] (03PS1) 10Brouberol: ceph-csi-rbd: move all dse-related values to a dedicated value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090438 (https://phabricator.wikimedia.org/T379601) [09:55:54] (03PS1) 10Brouberol: ceph-csi-rbd: convert the nodeplugin & provisioner clusterroles to ns-scoped roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090439 (https://phabricator.wikimedia.org/T379601) [09:55:55] (03PS1) 10Brouberol: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) [09:58:31] (03PS2) 10Brouberol: ceph-csi-rbd: convert the nodeplugin & provisioner clusterroles to ns-scoped roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090439 (https://phabricator.wikimedia.org/T379601) [09:58:31] (03PS2) 10Brouberol: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) [09:58:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090437 (owner: 10Muehlenhoff) [09:59:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db2217 gradually with 4 steps - T379491 [09:59:59] T379491: PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL - https://phabricator.wikimedia.org/T379491 [10:00:28] arnaudb: I'm guessing that's you :) [10:00:52] !incidents [10:00:52] 5392 (RESOLVED) NELHigh sre (thanos-rule tcp.address_unreachable) [10:01:28] oh damn.. no real page.... just used the page hashtag on the phab task title.. that's eveil [10:01:30] *evil [10:01:48] (03CR) 10CI reject: [V:04-1] Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:02:00] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090435 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [10:04:00] vgutierrez: sorry for the noise, I'll redact the #page next time 😱 [10:04:22] arnaudb: same for your messages here please :) [10:04:27] hahaha [10:04:29] lol [10:04:32] *facepalm* [10:04:33] indeed [10:04:50] (03Abandoned) 10Muehlenhoff: Simplify profile::cache::kafka::certificate to only support PKI/cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1030052 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [10:04:56] :D [10:05:10] at least you are well aware :') sorry again [10:05:16] (03PS3) 10Brouberol: ceph-csi-rbd: convert the nodeplugin & provisioner clusterroles to ns-scoped roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090439 (https://phabricator.wikimedia.org/T379601) [10:05:16] (03PS3) 10Brouberol: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) [10:09:14] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1090432 (https://phabricator.wikimedia.org/T377876) (owner: 10JMeybohm) [10:09:47] (03CR) 10Btullis: [C:03+1] ceph-csi-rbd: move all dse-related values to a dedicated value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090438 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:10:07] (03PS1) 10Arnaudb: mariadb: add db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1090445 (https://phabricator.wikimedia.org/T373579) [10:10:08] (03CR) 10Arnaudb: [C:03+2] mariadb: add db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1090445 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [10:12:00] 06SRE, 06Infrastructure-Foundations, 10netops: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10311548 (10cmooney) I discussed this briefly with @ayounsi on irc and while this is probably a good idea it won't, as things stand, p... [10:12:16] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db2236 slowly with 10 steps - slow repool T373579 [10:12:19] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [10:15:11] (03CR) 10Elukey: [C:03+2] profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [10:15:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:20:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:58] (03PS1) 10Michael Große: growthexperiments.pp: track dangling records for cswiki hourly [puppet] - 10https://gerrit.wikimedia.org/r/1090449 (https://phabricator.wikimedia.org/T372337) [10:27:07] (03CR) 10Brouberol: airflow: add airflow-wmde files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [10:27:50] (03PS1) 10Elukey: docker_registry_ha: allow /v2/_catalog only for internal clients [puppet] - 10https://gerrit.wikimedia.org/r/1090450 (https://phabricator.wikimedia.org/T378618) [10:28:55] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4493/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090450 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [10:30:04] (03CR) 10Ayounsi: Expose IPsec tunnel configuration from Netbox (034 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [10:31:02] (03CR) 10Brouberol: [C:03+2] airflow: fix configuration keys containing the analytics-hadoop cluster name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090398 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [10:31:05] (03CR) 10Brouberol: [C:03+2] airflow: add missing python dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090399 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [10:31:08] (03CR) 10Brouberol: [C:03+2] airflow: enable yarn log aggregation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090400 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [10:32:07] (03Merged) 10jenkins-bot: airflow: fix configuration keys containing the analytics-hadoop cluster name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090398 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [10:32:26] (03Merged) 10jenkins-bot: airflow: add missing python dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090399 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [10:32:27] (03Merged) 10jenkins-bot: airflow: enable yarn log aggregation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090400 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [10:33:14] (03CR) 10Btullis: [C:03+1] ceph-csi-rbd: convert the nodeplugin & provisioner clusterroles to ns-scoped roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090439 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:33:19] (03CR) 10Btullis: [C:03+1] Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:33:33] (03CR) 10Ayounsi: "lgtm, from the end diff I'm wondering if it's fine to not set the `authentication-algorithm` ?" [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [10:33:38] (03CR) 10Btullis: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:33:55] (03CR) 10Btullis: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:34:12] (03CR) 10Brouberol: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:34:56] (03PS4) 10Brouberol: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) [10:35:01] (03CR) 10Brouberol: Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:36:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:36:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:36:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:37:22] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [10:37:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:41:41] 06SRE, 06Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411#10311664 (10cmooney) > Certain NICs in our estate are not seen as 'onboard', and expose no 'acpi index'. This results in no ID_NET_NAME_ONBOARD being populated for... [10:42:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2217 gradually with 4 steps - T379491 [10:45:20] T379491: PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL - https://phabricator.wikimedia.org/T379491 [10:45:53] arnaudb: ^? [10:46:24] ^ arnaudb you probably want to rename that task :) [10:46:25] Cookbook just finished running [10:47:17] that task name is capital-E Evil [10:48:02] I renamed that [10:53:11] I'll rename it for the sake of probable future edits, sorry for the noise! I was in a meeting [10:53:43] ah RhinosF1 well done :) thanks [10:54:33] (03CR) 10Brouberol: [C:03+1] Update druid test config to drop unused segments automatically [puppet] - 10https://gerrit.wikimedia.org/r/1077653 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [10:57:36] (03CR) 10Brouberol: [C:03+2] Add an explicit list of namespaces in which to grant ceph-csi-rbd permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090440 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:57:41] (03CR) 10Brouberol: [C:03+2] ceph-csi-rbd: convert the nodeplugin & provisioner clusterroles to ns-scoped roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090439 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [10:57:44] (03CR) 10Brouberol: [C:03+2] ceph-csi-rbd: move all dse-related values to a dedicated value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090438 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1100) [11:01:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:01:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:02:00] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4494/" [puppet] - 10https://gerrit.wikimedia.org/r/1090433 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [11:04:19] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1090433 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [11:17:51] 06SRE, 06Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411#10311758 (10cmooney) Another complication I see is that on a SuperMicro device there is no //ID_NET_NAME_ONBOARD// populated, whereas on a Dell there is (though we... [11:18:01] (03PS1) 10Brouberol: ceph-csi-rbd: grant the ClusterRole permissions to list/watch secrets in all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090454 (https://phabricator.wikimedia.org/T379601) [11:19:46] (03PS1) 10Urbanecm: [CirrusSearch] testwiki: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090455 (https://phabricator.wikimedia.org/T378983) [11:23:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:24:08] (03PS2) 10Brouberol: ceph-csi-rbd: grant the ClusterRole permissions to list/watch secrets in all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090454 (https://phabricator.wikimedia.org/T379601) [11:24:54] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:10] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [11:27:48] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1013.eqiad.wmnet [11:31:07] (03PS1) 10Hashar: Upgrade wikimedia/relpath from 4.0.0 to 4.0.1 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090456 (https://phabricator.wikimedia.org/T379480) [11:32:33] (03CR) 10Jaime Nuche: [C:03+2] Upgrade wikimedia/relpath from 4.0.0 to 4.0.1 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090456 (https://phabricator.wikimedia.org/T379480) (owner: 10Hashar) [11:37:11] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:40:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:42:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:42:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:42:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1013.eqiad.wmnet [11:44:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:52] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1010.eqiad.wmnet [11:48:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311839 (10MoritzMuehlenhoff) [11:48:07] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [11:52:24] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:54:20] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:54:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1015.eqiad.wmnet [12:04:29] (03PS2) 10Ilias Sarantopoulos: ml-services: update aya model deployment to aya-expanse-8b [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088609 (https://phabricator.wikimedia.org/T379052) [12:04:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:07:30] (03Merged) 10jenkins-bot: Upgrade wikimedia/relpath from 4.0.0 to 4.0.1 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090456 (https://phabricator.wikimedia.org/T379480) (owner: 10Hashar) [12:08:34] (03PS1) 10Slyngshede: Filter out none posixGroup "group" in next_gid_number. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1090459 [12:08:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:08:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:08:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1010.eqiad.wmnet [12:09:01] !log remove ganeti1015 from active ganeti nodes T378921 [12:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:05] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [12:10:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311924 (10MoritzMuehlenhoff) [12:12:11] PROBLEM - ganeti-noded running on ganeti1015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:12:11] PROBLEM - ganeti-confd running on ganeti1015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:12:45] (03PS1) 10Muehlenhoff: Remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1090460 (https://phabricator.wikimedia.org/T379612) [12:14:08] FIRING: ProbeDown: Service ganeti1015:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:26] (03CR) 10Muehlenhoff: [C:03+2] Remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1090460 (https://phabricator.wikimedia.org/T379612) (owner: 10Muehlenhoff) [12:15:58] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission ganeti1010 / ganeti1013 - https://phabricator.wikimedia.org/T379612#10311936 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [12:21:10] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update aya model deployment to aya-expanse-8b [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088609 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos) [12:25:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [12:26:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311949 (10ops-monitoring-bot) Draining ganeti1012.eqiad.wmnet of running VMs [12:28:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2236 slowly with 10 steps - slow repool T373579 [12:28:16] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [12:28:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [12:31:57] (03PS1) 10Peter Fischer: CirrusSearch: enable offloading weighted tags via EventBus for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090462 (https://phabricator.wikimedia.org/T378983) [12:34:50] (03CR) 10Peter Fischer: [C:03+1] [CirrusSearch] testwiki: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090455 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [12:35:14] (03Abandoned) 10Peter Fischer: CirrusSearch: enable offloading weighted tags via EventBus for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090462 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [12:35:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1002.eqiad.wmnet to drbd [12:35:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10311979 (10ops-monitoring-bot) VM ml-etcd1002.eqiad.wmnet switching disk type to drbd [12:36:09] (03PS6) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [12:37:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10311981 (10cmooney) @Jgreen @Dwisehaupt I think we have broadly two options for how to proceed today: **Option 1:** Begin wit... [12:40:30] (03CR) 10Stevemunene: airflow: add airflow-wmde files (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [12:40:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1089702 (owner: 10Slyngshede) [12:41:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090464 (https://phabricator.wikimedia.org/T375662) [12:41:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090464 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [12:42:46] (03PS2) 10Muehlenhoff: cache::kafka::certificate: Remove $use_internal_ca [puppet] - 10https://gerrit.wikimedia.org/r/1088296 (https://phabricator.wikimedia.org/T337825) [12:44:22] (03PS1) 10Slyngshede: C:ldap::management: members -> member [puppet] - 10https://gerrit.wikimedia.org/r/1090465 [12:44:46] (03CR) 10Michael Große: [C:03+1] [CirrusSearch] testwiki: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090455 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [12:45:23] (03CR) 10Muehlenhoff: [C:03+2] cache::kafka::certificate: Remove $use_internal_ca [puppet] - 10https://gerrit.wikimedia.org/r/1088296 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [12:45:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1002.eqiad.wmnet to drbd [12:45:27] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:01] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [12:46:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:03] RESOLVED: ProbeDown: Service ganeti1015:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [12:52:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312017 (10ops-monitoring-bot) Draining ganeti1012.eqiad.wmnet of running VMs [12:53:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [12:53:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1002.eqiad.wmnet to plain [12:54:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312019 (10ops-monitoring-bot) VM ml-etcd1002.eqiad.wmnet switching disk type to plain [12:54:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1002.eqiad.wmnet to plain [12:58:33] (03CR) 10Brouberol: airflow: add airflow-wmde files (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [12:59:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to drbd [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1300) [13:00:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312022 (10ops-monitoring-bot) VM dse-k8s-etcd1003.eqiad.wmnet switching disk type to drbd [13:01:03] (03PS1) 10JMeybohm: etcd::v3: Ensure etcd peers srange is sorted [puppet] - 10https://gerrit.wikimedia.org/r/1090467 [13:01:15] (03PS1) 10Brouberol: airflow: upgrade image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090468 (https://phabricator.wikimedia.org/T377928) [13:02:11] (03CR) 10Gmodena: [C:03+1] "LGTM, but let's check with other flink users." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:02:39] (03CR) 10JMeybohm: [C:03+2] preseed: Migrate wikikube-ctrl1* to containerd partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1090432 (https://phabricator.wikimedia.org/T377876) (owner: 10JMeybohm) [13:02:49] (03CR) 10JMeybohm: [C:03+2] wikikube-staging: Remove obsolete docker hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1090433 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [13:06:01] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090467 (owner: 10JMeybohm) [13:08:08] (03CR) 10Brouberol: [C:03+2] airflow: upgrade image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090468 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [13:09:13] (03PS1) 10Slyngshede: P:idm-test: allow account manager permission to be requested. [puppet] - 10https://gerrit.wikimedia.org/r/1090469 [13:09:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:09:33] !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [13:09:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to drbd [13:09:44] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:10:20] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [13:10:22] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [13:10:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:11:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [13:11:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312042 (10ops-monitoring-bot) Draining ganeti1012.eqiad.wmnet of running VMs [13:11:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [13:12:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:44] this is me [13:13:30] Amir1,slyngs: FYI I'm reimaging k8s control planes of the wikikube-eqiad cluster [13:13:44] Noted [13:14:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to plain [13:15:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312066 (10ops-monitoring-bot) VM dse-k8s-etcd1003.eqiad.wmnet switching disk type to plain [13:15:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1003.eqiad.wmnet to plain [13:19:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090464 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [13:19:30] (03CR) 10Btullis: [C:03+1] spark: Avoid Ferm-specific syntax (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1087488 (owner: 10Muehlenhoff) [13:21:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [13:21:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312100 (10ops-monitoring-bot) Draining ganeti1012.eqiad.wmnet of running VMs [13:26:04] (03PS1) 10Brouberol: global_config: fix the role name behind analytics-test-hive [puppet] - 10https://gerrit.wikimedia.org/r/1090472 (https://phabricator.wikimedia.org/T379363) [13:28:19] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4495/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090472 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [13:28:38] hi, train presync failed last night, I'm going to run it in a few minutes if there are no objections [13:29:13] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622 (10JMeybohm) 03NEW [13:31:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1090469 (owner: 10Slyngshede) [13:32:06] (03PS3) 10Brouberol: ceph-csi-rbd: grant the ClusterRole permissions to list/watch secrets in all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090454 (https://phabricator.wikimedia.org/T379601) [13:35:23] (03CR) 10Slyngshede: [C:03+2] P:idm-test: allow account manager permission to be requested. [puppet] - 10https://gerrit.wikimedia.org/r/1090469 (owner: 10Slyngshede) [13:36:18] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090474 (https://phabricator.wikimedia.org/T375662) [13:36:19] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090474 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [13:36:38] (03PS4) 10Brouberol: ceph-csi-rbd: grant the ClusterRole permissions to list/watch secrets in all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090454 (https://phabricator.wikimedia.org/T379601) [13:37:05] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090474 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [13:37:31] !log jnuche@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.3 refs T375662 [13:37:35] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [13:43:34] !log jnuche@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.3 refs T375662 [13:43:38] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [13:44:13] (03CR) 10Brouberol: [C:03+2] ceph-csi-rbd: grant the ClusterRole permissions to list/watch secrets in all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090454 (https://phabricator.wikimedia.org/T379601) (owner: 10Brouberol) [13:44:54] (03CR) 10Muehlenhoff: C:ldap::management: members -> member (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [13:46:27] (03PS7) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [13:46:58] (03CR) 10Urbanecm: [C:03+1] growthexperiments.pp: track dangling records for cswiki hourly [puppet] - 10https://gerrit.wikimedia.org/r/1090449 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [13:47:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:47:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10312229 (10MoritzMuehlenhoff) [13:48:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:49:15] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10312242 (10MoritzMuehlenhoff) Will Supermicro send a replacement CPU for this server? [13:49:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:31] (03PS2) 10Urbanecm: [CirrusSearch] testwiki: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090455 (https://phabricator.wikimedia.org/T378983) [13:56:39] (03CR) 10Urbanecm: [C:03+2] [CirrusSearch] testwiki: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090455 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [13:57:03] (03CR) 10AOkoth: [C:03+2] Correct range of A-z [puppet] - 10https://gerrit.wikimedia.org/r/1089077 (https://phabricator.wikimedia.org/T362829) (owner: 10TheDJ) [13:57:19] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1090472 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [13:57:23] (03Merged) 10jenkins-bot: [CirrusSearch] testwiki: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090455 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [13:58:53] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1090455|[CirrusSearch] testwiki: enable offloading weighted tags via EventBus (T378983)]] [13:58:58] (03CR) 10Jelto: [C:03+1] "lgtm. That confused me in the previous pcc as well." [puppet] - 10https://gerrit.wikimedia.org/r/1090467 (owner: 10JMeybohm) [13:58:59] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [13:59:42] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: fix the role name behind analytics-test-hive [puppet] - 10https://gerrit.wikimedia.org/r/1090472 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1400). [14:00:05] tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] hey [14:00:16] tgr|away: hi, how are you? [14:01:10] hi urbanecm! [14:01:20] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628 (10jnuche) 03NEW [14:01:38] tgr|away: i'm shipping a config change, i can +2 your backport and then leave you to it? [14:01:41] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312343 (10jnuche) p:05Triage→03Unbreak! [14:01:50] okay, i can't... [14:01:52] because ^^ [14:01:59] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312346 (10jnuche) [14:02:31] does that affect backports? [14:02:36] tgr|away: i just got that error [14:02:43] but i'm rerunning just in case [14:02:56] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1090455|[CirrusSearch] testwiki: enable offloading weighted tags via EventBus (T378983)]] [14:03:23] maybe it is not deterministic? [14:03:26] no, it is [14:03:39] this is what's on my screen https://www.irccloud.com/pastebin/oivOHQe9/ [14:04:02] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629 (10JMeybohm) 03NEW [14:04:23] well, that settles the window i guess [14:05:15] running it by hand seems to work fine for me [14:05:19] eval.php I mean [14:05:25] yeah [14:05:26] same [14:06:38] hi, I ran into that error earlier today, I've filed: https://phabricator.wikimedia.org/T379628 [14:06:45] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10312375 (10JMeybohm) [14:07:02] ah, you already saw it :) [14:08:03] jnuche: yeah, on my term unfortunately [14:08:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:08:12] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312371 (10Urbanecm_WMF) This also started to affect backports: ` 14:02:56 Started scap sync-world: Backport for [[gerrit:1090455|[CirrusSearch] t... [14:08:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:09:48] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312388 (10jnuche) Script works outside of the container apparently: > Gergő Tisza running it by hand seems to work fine for me > 3:05... [14:10:17] tgr|away: i presume you used `mwscript` (as opposed to `scap mwscript`, which scap does)? [14:10:17] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10312400 (10RobH) **Vital Date Update** I failed to get this filed before I went away for a week, and now its too short notice to get it filed today. I'... [14:11:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for HArroyo-WMF - https://phabricator.wikimedia.org/T379630 (10hector.arroyo) 03NEW [14:12:29] (03PS2) 10Brouberol: airflow: remove fsGroup stanzas as all containers are running with the same uid/gid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088543 (https://phabricator.wikimedia.org/T379265) [14:12:54] yeah [14:12:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10312403 (10akosiaris) This is one host that is past the 5 year old mark for what is worth. It use... [14:12:58] (03CR) 10Ottomata: "Nit: anywhere we override the defaults, it would be nice to add a comment explaining why." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [14:13:00] what's the difference though? [14:13:06] (03CR) 10Ottomata: [C:03+1] "+1 otherwise!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [14:13:07] latter is containerized AFAIK [14:13:19] logstash says the actual command was 'multiversion/MWScript.php eval.php --wiki=testcommonswiki' [14:13:30] I tried mwscript-k8s, works as well [14:13:46] different container I guess? [14:14:08] possibly [14:14:09] (03CR) 10DCausse: wdqs: remove 5 codfw hosts from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [14:14:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1090459 (owner: 10Slyngshede) [14:16:40] /query jnuche [14:16:49] sorry totally wrong commands jnuche [14:16:52] but hi :) [14:16:54] hehe [14:17:00] hi there:D [14:17:00] tgr|away: running the full actual command indeed fails [14:17:54] I run into a similar issue yesterday, although in a different scap step: T379589 [14:17:54] T379589: scap backport fails at purgeMessageBlobStore.php with getaddrinfo failed - https://phabricator.wikimedia.org/T379589 [14:18:20] urbanecm: what's the full command you used? [14:18:29] that seemed like loading this fake config instead of the real one: [14:18:30] jnuche: `sudo -u mwbuilder -n -- /usr/bin/scap mwscript --no-local-config --directory /srv/mediawiki-staging --user www-data --network -- eval.php --wiki=testcommonswiki` [14:18:30] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/5556811bd79e5ddce8623c17f111e7409e0b510a/wmf-config/CommonSettings.php#170 [14:18:38] ty [14:19:12] so maybe that env flag is set for some reason? [14:19:42] https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/mwscript.py#L245 claims so [14:20:37] (03CR) 10Stevemunene: airflow: add airflow-wmde files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [14:21:15] (03CR) 10Ssingh: [C:03+1] "Looks good and since you have verified it works!" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [14:21:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:15] (03Abandoned) 10Brouberol: Fix typos in analytics-hadoop-test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088606 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [14:22:42] (03CR) 10DCausse: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [14:23:13] I got that error on the third deployment though, after two going through just fine. [14:23:33] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:26:32] !log installing apache2 security updates [14:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:05] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for IPs moving from old to new fundraising firewalls - cmooney@cumin1002" [14:28:09] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for IPs moving from old to new fundraising firewalls - cmooney@cumin1002" [14:28:09] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:40] (03CR) 10Btullis: [V:03+1 C:03+2] Update druid test config to drop unused segments automatically [puppet] - 10https://gerrit.wikimedia.org/r/1077653 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [14:29:47] (03CR) 10Brouberol: [C:03+1] ":shipit:!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [14:30:39] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on fasw-c-eqiad with reason: fundraising tech migration to new equipment [14:30:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on fasw-c-eqiad with reason: fundraising tech migration to new equipment [14:31:44] jnuche: note i have an undeployed patch scap merged [14:31:44] (03CR) 10Vgutierrez: [C:04-1] "ExecPaths and NoExecPaths aren't valid options on bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [14:31:48] do you want me to revert it? [14:31:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10312547 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fd1b13c3-25ae-42de-a138-bb1a3989c0b4) set by cmoon... [14:32:28] (03CR) 10Btullis: [C:03+1] airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [14:32:29] urbanecm: yeah, would probably be better to keep things clean, thx [14:33:26] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312530 (10Tgr) See also {T379589} which seems to have the same cause (using a mock DB config for offline operations) but occurred at a later scap... [14:34:37] (03PS1) 10Urbanecm: Revert "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090480 (https://phabricator.wikimedia.org/T378983) [14:34:43] (03CR) 10Urbanecm: [V:03+2 C:03+2] Revert "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090480 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [14:34:55] done [14:34:59] (03CR) 10Ssingh: [C:03+1] haproxykafka: systemd service hardening (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [14:36:29] (03PS1) 10Herron: admin: add ldap_only entry for harroyo-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1090481 (https://phabricator.wikimedia.org/T379630) [14:36:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:18] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry_ha: allow /v2/_catalog only for internal clients [puppet] - 10https://gerrit.wikimedia.org/r/1090450 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [14:39:57] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove irc1002/irc2002 from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089752 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [14:40:03] (03PS2) 10TChin: flink-app: Add default checkpointing config for Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) [14:40:42] (03Merged) 10jenkins-bot: Remove irc1002/irc2002 from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089752 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [14:46:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:50] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1090437 (owner: 10Muehlenhoff) [14:51:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:30] jnuche: am I clear to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1089752 ? [14:51:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:15] akosiaris: sry, scap deployments are currently blocked by T379628 [14:53:15] T379628: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628 [14:53:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:12] jnuche: ok, thanks for letting me know! [14:54:33] 06SRE, 06serviceops, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312636 (10CDanis) @bvibber @aude @Jdlrobson @CCiufo-WMF @Seddon FYI [14:56:54] (03CR) 10Herron: [C:03+1] "Nice one, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [14:57:06] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for khantstop - https://phabricator.wikimedia.org/T379409#10312638 (10herron) 05Open→03Resolved a:03herron uid=khantstop has been added to ldap group wmf [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [15:03:59] (03CR) 10DCausse: flink-app: Add default checkpointing config for Flink 1.20 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [15:03:59] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1090481 (https://phabricator.wikimedia.org/T379630) (owner: 10Herron) [15:04:08] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cr[1-2]-eqiad,pfw3-eqiad with reason: fundraising tech migration to new equipment [15:04:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cr[1-2]-eqiad,pfw3-eqiad with reason: fundraising tech migration to new equipment [15:04:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10312686 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6d3e8237-b81b-47ec-a63c-afd9f7859ae7) set by cmoon... [15:06:25] (03CR) 10Slyngshede: C:ldap::management: members -> member (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [15:06:52] (03CR) 10Slyngshede: [C:03+2] P:idp type check Redis keys before accessing [puppet] - 10https://gerrit.wikimedia.org/r/1089702 (owner: 10Slyngshede) [15:09:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10312707 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This was fixed by removing the internal NIC(s) as well as the unused port of the 10G NIC fro... [15:09:59] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [15:11:51] (03CR) 10Hnowlan: "Currently, when we see a surge or a rebalancing of traffic, we frequently see queues piling up before we start serving 5xx errors. This pa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089832 (https://phabricator.wikimedia.org/T379561) (owner: 10Hnowlan) [15:12:33] FIRING: KubernetesCalicoDown: wikikube-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:13:25] !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=99) Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [15:14:50] !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [15:16:23] !log moving fundraising links in eqiad from old to new firewall cluster and switches (T377381) [15:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] T377381: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381 [15:16:38] !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1002.eqiad.wmnet [15:16:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1002.eqiad.wmnet [15:17:34] (03PS1) 10Fabfur: haproxykafka: remove group dependency [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) [15:18:10] (03CR) 10CI reject: [V:04-1] haproxykafka: remove group dependency [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:19:14] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [15:19:28] (03CR) 10Vgutierrez: [C:03+1] "please get this merged ASAP to unbreak puppet on deployment-prep cp instances" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:19:44] (03PS2) 10Fabfur: haproxykafka: remove group dependency [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) [15:23:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:25:14] (03PS3) 10Fabfur: haproxykafka: remove group dependency [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) [15:26:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:26:34] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64700/IPv4: Idle - frack-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:57] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:35:24] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for IPs moving from old to new fundraising firewalls - cmooney@cumin1002" [15:36:19] (03PS2) 10Scott French: changeprop-jobqueue: set max poll interval and revert concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087558 (https://phabricator.wikimedia.org/T356241) [15:36:30] (03CR) 10Herron: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1090481 (https://phabricator.wikimedia.org/T379630) (owner: 10Herron) [15:38:32] (03PS1) 10Hamish: Revert "cswiki: Add celebration logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090493 [15:39:12] (03PS1) 10Brouberol: airflow-analytics-test: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090494 (https://phabricator.wikimedia.org/T379136) [15:39:12] (03PS1) 10Brouberol: airflow-analytics-product: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090495 (https://phabricator.wikimedia.org/T379136) [15:39:13] (03PS1) 10Brouberol: airflow-platform-eng: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090496 (https://phabricator.wikimedia.org/T379136) [15:39:14] (03PS1) 10Brouberol: airflow-research: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090497 (https://phabricator.wikimedia.org/T379136) [15:39:14] (03PS1) 10Brouberol: airflow-search: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090498 (https://phabricator.wikimedia.org/T379136) [15:39:15] (03PS1) 10Brouberol: airflow-wmde: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090499 (https://phabricator.wikimedia.org/T379136) [15:39:19] (03PS1) 10Brouberol: airflow-analytics: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090500 (https://phabricator.wikimedia.org/T379136) [15:39:23] (03PS1) 10Brouberol: airflow: set airflow default value to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090501 (https://phabricator.wikimedia.org/T379136) [15:39:27] (03PS2) 10Hamish: Revert "cswiki: Add celebration logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090493 (https://phabricator.wikimedia.org/T379613) [15:40:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090493 (https://phabricator.wikimedia.org/T379613) (owner: 10Hamish) [15:40:15] (03CR) 10Fabfur: [C:03+2] hiera: do not install haproxykafka on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:40:26] (03PS1) 10Srishakatux: Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) [15:41:36] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for IPs moving from old to new fundraising firewalls - cmooney@cumin1002" [15:42:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:45] (03CR) 10Stevemunene: [C:03+2] airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [15:43:20] (03CR) 10Ssingh: [C:03+1] haproxykafka: remove group dependency [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:43:35] (03CR) 10Cathal Mooney: [C:03+2] Remove pfw3-eqiad and replace with pfw1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088537 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [15:45:40] (03CR) 10Scott French: [C:03+2] changeprop-jobqueue: set max poll interval and revert concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087558 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [15:46:06] (03CR) 10Fabfur: [C:03+2] haproxykafka: remove group dependency [puppet] - 10https://gerrit.wikimedia.org/r/1090489 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:46:11] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-test: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090494 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:46:12] (03Merged) 10jenkins-bot: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [15:46:28] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-product: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090495 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:46:39] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090496 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:46:52] (03CR) 10Stevemunene: [C:03+1] airflow-research: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090497 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:47:06] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for HArroyo-WMF - https://phabricator.wikimedia.org/T379630#10312930 (10herron) 05Open→03Resolved a:03herron membership to ldap group `wmf` has been provisioned, thanks! [15:47:07] (03CR) 10Stevemunene: [C:03+1] airflow-search: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090498 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:47:19] (03CR) 10Stevemunene: [C:03+1] airflow-wmde: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090499 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:47:30] (03CR) 10Stevemunene: [C:03+1] airflow-analytics: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090500 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:47:46] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [15:47:57] (03PS2) 10Volans: Drop Python support for 3.7, 3.8, add 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 [15:47:57] (03PS2) 10Volans: Use importlib.metadata instead of pkg_resources [software/cumin] - 10https://gerrit.wikimedia.org/r/1029210 [15:47:57] (03PS1) 10Volans: Add support for Python 3.12 [software/cumin] - 10https://gerrit.wikimedia.org/r/1090504 [15:47:57] (03PS1) 10Volans: Integration tests: use linuxserver/openssh-server [software/cumin] - 10https://gerrit.wikimedia.org/r/1090505 [15:48:05] (03CR) 10Stevemunene: [C:03+1] airflow: set airflow default value to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090501 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:48:18] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090494 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:48:25] (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090495 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [15:48:34] (03CR) 10Volans: [C:03+2] mysql_legacy: improve pymysql usability [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 (owner: 10Volans) [15:48:51] (03CR) 10Volans: [C:03+2] mysql: remove deprecated call [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087853 (owner: 10Volans) [15:49:03] (03Merged) 10jenkins-bot: changeprop-jobqueue: set max poll interval and revert concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087558 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [15:49:14] (03CR) 10Volans: [C:03+2] mysql_legacy: add MysqlClient class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [15:49:35] (03CR) 10Scott French: "Thanks, all! I'll aim to backport this during the UTC-late infra window as long as scap is working again (https://phabricator.wikimedia.or" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [15:52:05] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:52:30] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:53:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:56:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:37] (03PS1) 10Fabfur: haproxykafka: removing unused group resource [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) [15:56:46] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [15:56:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:57:03] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [15:57:20] (03CR) 10CI reject: [V:04-1] haproxykafka: removing unused group resource [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:58:44] (03PS2) 10Fabfur: haproxykafka: removing unused group resource [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) [16:00:06] eoghan, jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1600). [16:00:35] (03Merged) 10jenkins-bot: mysql_legacy: improve pymysql usability [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 (owner: 10Volans) [16:00:36] (03Merged) 10jenkins-bot: mysql: remove deprecated call [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087853 (owner: 10Volans) [16:00:36] (03Merged) 10jenkins-bot: mysql_legacy: add MysqlClient class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [16:01:31] (03CR) 10Ssingh: [C:03+1] "[Not very sure but let's try it.]" [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:02:11] (03CR) 10Fabfur: [C:03+2] haproxykafka: removing unused group resource [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:02:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:04:08] (03CR) 10CI reject: [V:04-1] Drop Python support for 3.7, 3.8, add 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [16:06:08] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090506 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:06:28] (03CR) 10Volans: "The mypy CI failure is fixed in the next commit of the chain" [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [16:07:42] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:08:21] (03PS1) 10Jgiannelos: push-notifications: Bump imaget to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090509 [16:08:33] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:08:56] 06SRE, 06serviceops, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313092 (10brennen) [16:09:18] (03PS2) 10Jgiannelos: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090509 [16:09:44] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Fix login redirection to preserve dashboard context [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [16:12:05] (03CR) 10Jgiannelos: [C:03+2] push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090509 (owner: 10Jgiannelos) [16:12:30] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090496 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [16:13:09] (03Merged) 10jenkins-bot: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090509 (owner: 10Jgiannelos) [16:13:39] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for cr[1-2]-eqiad [16:13:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr[1-2]-eqiad [16:15:30] (03PS4) 10Fabfur: haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) [16:15:39] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [16:15:45] (03CR) 10Fabfur: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [16:16:06] (03CR) 10CI reject: [V:04-1] haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [16:16:34] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [16:16:41] (03PS5) 10Fabfur: haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) [16:17:02] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [16:17:38] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [16:18:14] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [16:18:56] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [16:23:42] (03CR) 10Brouberol: [C:03+2] airflow-research: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090497 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [16:23:55] (03PS1) 10Dreamy Jazz: Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090511 (https://phabricator.wikimedia.org/T379583) [16:23:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10313129 (10cmooney) Migration work is now complete, bastion and all hosts are reachable again following the moves. BGP is est... [16:26:57] (03PS6) 10Fabfur: haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) [16:27:49] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:29:51] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:30:36] !log dancy@deploy2002 Installing scap version "4.123.0" for 209 hosts [16:30:45] (03CR) 10Brouberol: [C:03+2] airflow-search: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090498 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [16:34:58] !log dancy@deploy2002 Installation of scap version "4.123.0" completed for 209 hosts [16:37:25] jouncebot nowandnext [16:37:25] For the next 0 hour(s) and 22 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1600) [16:37:25] In 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1700) [16:37:49] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:37:52] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [16:38:25] (03CR) 10Brouberol: [C:03+2] airflow-wmde: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090499 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [16:38:49] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:39:11] (03PS1) 10Tchanders: Disallow AbuseFilter protected variables use on non-temp-user wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090515 (https://phabricator.wikimedia.org/T379503) [16:39:14] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:40:31] !log brennen@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.3 refs T375662 [16:40:37] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [16:41:46] 06SRE, 06serviceops, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313250 (10dancy) scap 4.123.0 has been deployed which should address this problem. [16:43:49] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:44:22] (03CR) 10Brouberol: [C:03+2] airflow-analytics: upgrade airflow to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090500 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [16:44:49] RECOVERY - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:45:24] (03PS1) 10Jgiannelos: Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090516 [16:45:52] (03PS2) 10Brouberol: airflow: set airflow default value to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090501 (https://phabricator.wikimedia.org/T379136) [16:46:16] (03CR) 10Dreamy Jazz: [C:03+1] Disallow AbuseFilter protected variables use on non-temp-user wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090515 (https://phabricator.wikimedia.org/T379503) (owner: 10Tchanders) [16:47:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:48:24] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [16:50:19] We tried deploying the latest version of push-notifications that updates firebase sdk and we end up encountering timeout error on outgoing requests to firebase: https://phabricator.wikimedia.org/T379647 [16:51:32] I am reverting as we speak but we can't reproduce it in our local env. I suspect that there must be something wrong with the http proxy that we use for outgoing requests and ipv6 [16:51:40] (03CR) 10Jgiannelos: [C:03+2] Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090516 (owner: 10Jgiannelos) [16:51:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090518 [16:51:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090518 (owner: 10TrainBranchBot) [16:52:06] (03CR) 10Brouberol: [C:03+2] airflow: set airflow default value to 2.10.3 [puppet] - 10https://gerrit.wikimedia.org/r/1090501 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [16:52:47] (03Merged) 10jenkins-bot: Revert "push-notifications: Bump image to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090516 (owner: 10Jgiannelos) [16:53:30] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [16:54:15] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [16:54:35] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [16:54:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:57] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [16:56:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:51] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10313449 (10Jhancock.wm) Working on getting the process ironed out. I'll let you know as soon as i have an update. for now i did put the other CPU back in and rotated them.... [17:00:04] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1700). Please do the needful. [17:00:04] MichaelG_WMF: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:13] Hi o/ [17:00:24] 👋 [17:00:39] running PCC just in case I missed something, then I'll go ahead and merge [17:00:49] want a manual run, or happy to just let the next one happen on schedule? [17:00:52] Thank you :) [17:01:21] just letting the next on run on schedule is completely fine 👍 [17:01:27] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4496/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090449 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [17:02:00] (03CR) 10RLazarus: [V:03+1 C:03+2] growthexperiments.pp: track dangling records for cswiki hourly [puppet] - 10https://gerrit.wikimedia.org/r/1090449 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [17:02:31] MichaelG_WMF: sgtm! puppet window complete, then :) [17:02:48] thanks :D [17:05:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10313473 (10JMeybohm) 05Resolved→03Open Unfortunately this worked only once. Now the PXE boot hangs right after "All rights reserved." with no further... [17:06:49] (03PS2) 10Scott French: changeprop: add latency_sensitive_jobs_config (jobqueue) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) [17:06:54] (03PS3) 10Scott French: changeprop-jobqueue: add AssembleUploadChunks rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) [17:06:56] (03PS3) 10Scott French: changeprop-jobqueue: enable AssembleUploadChunks rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089315 (https://phabricator.wikimedia.org/T379035) [17:21:11] (03PS1) 10Scott French: changeprop-jobqueue: double concurrency for transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090519 (https://phabricator.wikimedia.org/T356241) [17:23:22] (03PS1) 10Raymond Ndibe: profile::manifests::toolforge::bastion: add harbor url to /etc/toolforge/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) [17:24:54] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: double concurrency for transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090519 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:25:38] (03CR) 10CI reject: [V:04-1] profile::manifests::toolforge::bastion: add harbor url to /etc/toolforge/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [17:26:01] !log brennen@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.3 refs T375662 (duration: 45m 29s) [17:26:04] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [17:28:36] 06SRE, 06serviceops, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313587 (10brennen) a:05brennen→03dancy [17:30:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090518 (owner: 10TrainBranchBot) [17:30:45] (03CR) 10Scott French: [C:03+2] changeprop-jobqueue: double concurrency for transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090519 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:32:07] (03Merged) 10jenkins-bot: changeprop-jobqueue: double concurrency for transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090519 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:32:22] 06SRE, 06serviceops, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313583 (10brennen) 05Open→03Resolved a:03brennen > scap 4.123.0 has been deployed which should add... [17:34:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:35:53] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:39:55] (03CR) 10BCornwall: [C:03+1] Update corto puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [17:44:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10313650 (10JMeybohm) I've asked @VRiley-WMF / @Jclark-ctr on IRC if they could switch the cable from Slot 2 to Slot 1 (our default) to maybe convince the... [17:49:14] sukhe, urandom: FYI We're currently down 1 control-plane in wikikube-eqiad because of hardware failure(s) (so 2 out of 3 control-planes and etcd nodes are working). It's a "working but suboptimal" state, especally for etcd [17:49:26] https://phabricator.wikimedia.org/T379629 [17:50:22] jayme: noted and thanks! [17:51:06] 06SRE-OnFire, 10Incident Tooling: corto: have CI build the Debian package - https://phabricator.wikimedia.org/T379305#10313695 (10BCornwall) This has already been implemented (starting from comment [[ https://gitlab.wikimedia.org/repos/sre/corto/-/commit/7b9ced9c53522b4fa2c667b98bed42e1a11bed17 | 7b9ced9c ]]). :) [17:54:09] 06SRE-OnFire, 10Incident Tooling: corto: have CI build the Debian package - https://phabricator.wikimedia.org/T379305#10313705 (10BCornwall) →14Duplicate dup:03T370788 [17:55:48] 06SRE-OnFire, 10Incident Tooling: corto: CI & packaging - https://phabricator.wikimedia.org/T370788#10313707 (10BCornwall) [17:56:49] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [17:57:01] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [17:57:54] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [17:58:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:01] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [18:00:04] swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1800). [18:00:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:01:16] here, and attempting to check / confirm a couple of things before proceeding [18:01:43] !log remove ganeti1012 from active ganeti nodes T378921 [18:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:48] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [18:02:17] 06SRE, 06serviceops, 10Wikimedia-Site-requests, 07Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893#10313765 (10MaryMunyoki) [18:03:18] 06SRE, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10313778 (10MaryMunyoki) [18:04:19] PROBLEM - ganeti-noded running on ganeti1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [18:04:19] PROBLEM - ganeti-confd running on ganeti1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [18:04:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10313850 (10MoritzMuehlenhoff) [18:05:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088770 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [18:06:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088771 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [18:07:03] FIRING: ProbeDown: Service ganeti1012:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:44] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [18:08:59] (03CR) 10Muehlenhoff: C:ldap::management: members -> member (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [18:10:55] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10313867 (10MoritzMuehlenhoff) Thanks for the update, there's is no hurry, since we still have the old server(s), which ganeti2042 would eventually replace. I was just curiou... [18:11:11] starting now [18:11:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [18:12:13] (03Merged) 10jenkins-bot: Add title-case mapping to support migration to PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [18:12:51] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1087604|Add title-case mapping to support migration to PHP 8.1 (T372603)]] [18:12:54] T372603: Regenerate UcfirstOverrides.php for PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T372603 [18:19:09] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1087604|Add title-case mapping to support migration to PHP 8.1 (T372603)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:19:16] T372603: Regenerate UcfirstOverrides.php for PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T372603 [18:21:42] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349#10313964 (10Jhancock.wm) 05Open→03Resolved [18:21:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10313973 (10Jhancock.wm) [18:22:03] RESOLVED: ProbeDown: Service ganeti1012:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10313976 (10Jhancock.wm) A3 was full. I racked that one in A2 and updated the task description [18:24:06] (03CR) 10Muehlenhoff: "The code looks good (sans a few typos inline), but I think we're missing a check whether the ppolicy is configured? The overlay is optiona" [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [18:24:38] !log verified consistent 7.4-like title-case behavior in 7.4- and 8.1-based images, verified expected treatment of eszett in mwdebug - T372603 [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:42] T372603: Regenerate UcfirstOverrides.php for PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T372603 [18:25:00] !log swfrench@deploy2002 swfrench: Continuing with sync [18:27:51] (03PS1) 10Cathal Mooney: Swap order of if statements for vlan config [homer/public] - 10https://gerrit.wikimedia.org/r/1090525 (https://phabricator.wikimedia.org/T268802) [18:29:56] (03PS1) 10Hnowlan: TimedMediahandler: reenable shellbox-video for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090526 (https://phabricator.wikimedia.org/T356241) [18:30:20] (03CR) 10Cathal Mooney: [C:03+2] Swap order of if statements for vlan config [homer/public] - 10https://gerrit.wikimedia.org/r/1090525 (https://phabricator.wikimedia.org/T268802) (owner: 10Cathal Mooney) [18:30:36] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668 (10phaultfinder) 03NEW [18:30:56] (03Merged) 10jenkins-bot: Swap order of if statements for vlan config [homer/public] - 10https://gerrit.wikimedia.org/r/1090525 (https://phabricator.wikimedia.org/T268802) (owner: 10Cathal Mooney) [18:30:59] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:31:40] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087604|Add title-case mapping to support migration to PHP 8.1 (T372603)]] (duration: 18m 48s) [18:31:47] T372603: Regenerate UcfirstOverrides.php for PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T372603 [18:33:33] FYI, I am done with the infra window [18:38:59] (03CR) 10Scott French: [C:03+1] TimedMediahandler: reenable shellbox-video for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090526 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [18:43:01] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [18:44:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:45:03] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:48:42] (03PS1) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [18:49:19] (03CR) 10CI reject: [V:04-1] [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [18:49:26] (03PS3) 10Cathal Mooney: Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) [18:49:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:53:37] (03PS1) 10Scott French: mwdebug-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085494 (https://phabricator.wikimedia.org/T372604) [18:53:40] (03PS1) 10Scott French: hieradata: switch mw-debug "next" to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1087983 (https://phabricator.wikimedia.org/T372604) [18:55:05] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [18:55:50] !log installing libarchive security updates [18:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1900). [19:00:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:11] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [19:01:25] o/ [19:02:07] (03PS2) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [19:02:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:45] (03CR) 10CI reject: [V:04-1] [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [19:04:11] PROBLEM - Juniper alarms on fasw-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:06:24] !log 1.44.0-wmf.3 train status (T375662): no current blockers, rolling to group0. [19:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:28] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:06:43] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090530 (https://phabricator.wikimedia.org/T375662) [19:06:45] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090530 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [19:07:29] (03PS5) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) [19:07:49] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090530 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [19:11:05] (03PS3) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [19:12:28] (03CR) 10Cathal Mooney: "Uploaded new patch to change how the dicts are built. On the hmac algorithm I'm in two minds really so let me know. Validator makes sens" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [19:12:33] FIRING: KubernetesCalicoDown: wikikube-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:13:04] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: host reimage [19:13:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10314144 (10JMeybohm) @Jclark-ctr reseated the cable into Slot 1 and while the link did not immediately show up via LED or iDRAC web-ui, it was shown as up... [19:14:00] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [19:14:42] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.3 refs T375662 [19:14:48] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:14:55] (03CR) 10Cathal Mooney: "If using AES in GCM mode you don't need to have a separate HMAC algorithm configured for authentication (and indeed the device won't let y" [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [19:16:26] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: host reimage [19:25:27] (03PS4) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [19:28:21] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [19:36:34] (03CR) 10JHathaway: [C:03+1] docker_registry_ha: allow /v2/_catalog only for internal clients [puppet] - 10https://gerrit.wikimedia.org/r/1090450 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [19:40:32] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1001.eqiad.wmnet with OS bookworm [19:42:08] !log jayme@cumin2002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl1001.* [19:42:21] !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1001.eqiad.wmnet [19:42:22] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1001.eqiad.wmnet [19:44:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10314250 (10phaultfinder) [19:45:23] sukhe, urandom: wikikube-eqiad control plane has been restored, you may cross that off [19:47:29] thanks jayme! [20:02:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:52] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10314310 (10phaultfinder) [20:16:12] (03PS5) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:23:12] (03PS6) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:28:59] (03PS7) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:33:06] (03PS8) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:36:17] (03PS9) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:36:53] jouncebot: nowandnext [20:36:53] For the next 0 hour(s) and 23 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T1900) [20:36:53] In 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T2100) [20:39:27] (03CR) 10Urbanecm: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090515 (https://phabricator.wikimedia.org/T379503) (owner: 10Tchanders) [20:41:59] (03PS10) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:42:18] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090493 (https://phabricator.wikimedia.org/T379613) (owner: 10Hamish) [20:42:35] (03PS1) 10Urbanecm: Revert^2 "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090550 (https://phabricator.wikimedia.org/T378983) [20:46:10] !log ebysans@deploy2002 Started deploy [analytics/refinery@113ea5a]: Regular analytics weekly train [analytics/refinery@113ea5ac] [20:47:25] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update aya model deployment to aya-expanse-8b [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088609 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos) [20:48:29] (03Merged) 10jenkins-bot: ml-services: update aya model deployment to aya-expanse-8b [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088609 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos) [20:49:19] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [20:49:36] (03PS11) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:53:47] !log ebysans@deploy2002 Finished deploy [analytics/refinery@113ea5a]: Regular analytics weekly train [analytics/refinery@113ea5ac] (duration: 07m 37s) [20:54:15] !log ebysans@deploy2002 Started deploy [analytics/refinery@113ea5a] (thin): Regular analytics weekly train THIN [analytics/refinery@113ea5ac] [20:55:29] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678 (10Dbrant) 03NEW [20:59:10] !log ebysans@deploy2002 Finished deploy [analytics/refinery@113ea5a] (thin): Regular analytics weekly train THIN [analytics/refinery@113ea5ac] (duration: 04m 54s) [20:59:56] !log ebysans@deploy2002 Started deploy [analytics/refinery@113ea5a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@113ea5ac] [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241112T2100). [21:00:05] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] i can deploy today [21:00:19] thanks! [21:00:29] (03CR) 10Urbanecm: [C:03+2] Fix warning about missing central account for temp users [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088770 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [21:00:29] (03CR) 10Urbanecm: [C:03+2] Check session provider when autocreating [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088771 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [21:00:33] (03CR) 10Urbanecm: [C:03+2] Revert^2 "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090550 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [21:00:36] second try :) [21:00:54] the two CentralAuth patches can go together [21:01:28] (03Merged) 10jenkins-bot: Revert^2 "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090550 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [21:02:06] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1090550|Revert^2 "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" (T378983)]] [21:02:09] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [21:04:05] !log ebysans@deploy2002 Finished deploy [analytics/refinery@113ea5a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@113ea5ac] (duration: 04m 09s) [21:05:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:22] (03PS12) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [21:09:24] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090550|Revert^2 "[CirrusSearch] testwiki: enable offloading weighted tags via EventBus" (T378983)]] (duration: 07m 18s) [21:09:35] (03Merged) 10jenkins-bot: Fix warning about missing central account for temp users [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088770 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [21:10:12] (03Merged) 10jenkins-bot: Check session provider when autocreating [extensions/CentralAuth] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088771 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [21:10:29] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [21:11:02] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1088770|Fix warning about missing central account for temp users (T378289)]], [[gerrit:1088771|Check session provider when autocreating (T378289)]] [21:11:17] T378289: SUL accounts with unattached Wikitech accounts auto-creating unattached accounts on other wikis - https://phabricator.wikimedia.org/T378289 [21:11:49] (03PS1) 10Eevans: Relocate corto config to hieradata/role/common/alerting_host.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1090552 (https://phabricator.wikimedia.org/T379204) [21:11:50] (03PS13) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [21:13:12] (03CR) 10Eevans: [V:03+2 C:03+2] Relocate corto config to hieradata/role/common/alerting_host.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1090552 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [21:13:32] !log urbanecm@deploy2002 urbanecm, tgr: Backport for [[gerrit:1088770|Fix warning about missing central account for temp users (T378289)]], [[gerrit:1088771|Check session provider when autocreating (T378289)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:41] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4507/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [21:13:41] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [21:19:38] tgr|away: oh, it synced to debug already. OK from your side? [21:21:10] urbanecm: can't really test it, but it didn't break account autocreation, I'll call that good enough [21:22:29] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [21:22:33] sounds good [21:22:34] !log urbanecm@deploy2002 urbanecm, tgr: Continuing with sync [21:22:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10314571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [21:23:20] !log Deployed refinery using scap, then deployed onto hdfs [21:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:24] (03CR) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [21:25:14] !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@58d7b82]: (no justification provided) [21:27:13] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088770|Fix warning about missing central account for temp users (T378289)]], [[gerrit:1088771|Check session provider when autocreating (T378289)]] (duration: 16m 11s) [21:27:16] T378289: SUL accounts with unattached Wikitech accounts auto-creating unattached accounts on other wikis - https://phabricator.wikimedia.org/T378289 [21:27:28] !log deploying airflow as part of weekly deployment train [21:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:39] deployed :) [21:27:42] anything else? [21:27:43] tgr|away: or others [21:28:26] thanks urbanecm! [21:28:58] !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@58d7b82]: (no justification provided) (duration: 03m 50s) [21:42:56] 06SRE, 10Charts, 06Infrastructure-Foundations, 07Service-deployment-requests: New Service Request: chart-renderer - https://phabricator.wikimedia.org/T376939#10314680 (10CCiufo-WMF) @CDanis we can close this out right? [21:44:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:49:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:55:05] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [21:55:10] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:07:26] PROBLEM - Host ms-be2082 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:34] RECOVERY - Host ms-be2082 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [22:11:44] (03PS1) 10CDanis: docker-pkg: add upstream_version template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 [22:11:45] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [22:11:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10314728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple... [22:12:09] (03CR) 10CDanis: "you can hate the implementation and the tests, but you can't hate the idea" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 (owner: 10CDanis) [22:35:15] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [22:35:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10314779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [22:36:08] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10314781 (10Andrew) Before I went on sabbatical I spent a while trying to decide if we can make wikitech-static into an actual static site. A static site would need much... [22:37:39] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10314787 (10Andrew) [23:08:54] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [23:11:34] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [23:28:08] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [23:28:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10315010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple... [23:34:43] (03PS1) 10Scott French: changeprop-jobqueue: increase webVideoTranscode concurrency to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090567 (https://phabricator.wikimedia.org/T356241) [23:38:09] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090570 [23:38:28] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090570 (owner: 10Ahmon Dancy) [23:39:12] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090570 (owner: 10Ahmon Dancy) [23:44:39] (03PS1) 10Ahmon Dancy: wikiversions-dev.json: Remove labtestwiki [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090571 [23:44:56] (03CR) 10Ahmon Dancy: [C:03+2] wikiversions-dev.json: Remove labtestwiki [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090571 (owner: 10Ahmon Dancy) [23:45:37] (03Merged) 10jenkins-bot: wikiversions-dev.json: Remove labtestwiki [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090571 (owner: 10Ahmon Dancy) [23:52:35] (03PS1) 10BCornwall: varnish: Pin varnish/modules versions to prod [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) [23:54:40] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4508/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:55:19] (03PS1) 10Ahmon Dancy: wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090573 [23:55:38] (03CR) 10Ahmon Dancy: [C:03+2] wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090573 (owner: 10Ahmon Dancy) [23:56:43] (03Merged) 10jenkins-bot: wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090573 (owner: 10Ahmon Dancy)