[00:00:44] 06SRE-OnFire, 10Incident Tooling: Corto: ensure Phabricator tasks are created with correct default visibility & priority - https://phabricator.wikimedia.org/T376500#10291276 (10Eevans) >>! In T376500#10286143, @Aklapper wrote: > @corto could set view policy and edit policy to #acl_sre-team (`PHID-PROJ-fqb3bher... [00:01:58] 06SRE-OnFire, 10Incident Tooling: Corto: ensure Phabricator tasks are created with correct default visibility & priority - https://phabricator.wikimedia.org/T376500#10291278 (10Eevans) 05Open→03Resolved a:03Eevans [00:03:29] (03CR) 10RLazarus: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [00:04:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:06:19] jouncebot: nowandnext [00:06:19] No deployments scheduled for the next 2 hour(s) and 53 minute(s) [00:06:19] In 2 hour(s) and 53 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0300) [00:07:05] quick rollout to resolve the helmfile diffs for a chart bump that only affects mw-script [00:08:09] !log rzl@deploy2002 Started scap sync-world: 1085506 [00:10:02] !log rzl@deploy2002 Finished scap sync-world: 1085506 (duration: 02m 50s) [00:10:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:11:14] (03CR) 10RLazarus: [C:03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1085507 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [00:21:59] (03PS8) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [00:22:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10291293 (10Jhancock.wm) mc-gp2006 nic port isn't coming up. tried different DAC cables but not coming up. will try a different port to see if it's a switch issue in the morning. [00:32:50] RECOVERY - MD RAID on wikikube-worker2068 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:34:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1087269 [00:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1087269 (owner: 10TrainBranchBot) [01:01:09] (03PS2) 10Aude: Create releases for chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) [01:01:59] (03CR) 10Aude: Create releases for chart-renderer service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [01:03:54] RECOVERY - Host ganeti2042 is UP: PING WARNING - Packet loss = 75%, RTA = 30.38 ms [01:03:58] PROBLEM - SSH on ganeti2042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:08:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1087273 [01:08:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1087273 (owner: 10TrainBranchBot) [01:10:18] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1087269 (owner: 10TrainBranchBot) [01:11:24] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10291366 (10Platonides) I have been testing this by sending emails to be held. **It is very clearly losing emails**. There is a portion in the h... [01:14:21] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:41:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1087273 (owner: 10TrainBranchBot) [02:08:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.2 [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087276 (https://phabricator.wikimedia.org/T375661) [02:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.2 [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087276 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [02:37:37] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:43] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.2 [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087276 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0300) [03:02:37] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0400) [04:01:43] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087290 (https://phabricator.wikimedia.org/T375661) [04:01:44] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087290 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [04:02:32] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087290 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [04:03:01] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.2 refs T375661 [04:03:04] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0500) [05:10:39] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.27 (duration: 10m 37s) [05:14:25] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:16] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:12:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:13:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:13:54] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0700) [07:00:05] marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0700). nyaa~ [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:39:36] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 11414 [07:39:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11414 [07:41:18] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:55:04] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 14593 [07:55:12] PROBLEM - Memcached on idp1004 is CRITICAL: connect to address 208.80.154.7 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [07:56:15] (03PS1) 10Fabfur: hiera: haproxykafka defaults to 2 workers [puppet] - 10https://gerrit.wikimedia.org/r/1087359 (https://phabricator.wikimedia.org/T374473) [07:56:31] idp1004 is just some alert spam, should recover shortly [07:56:40] memcached is unused there and I'm cleaning up things [07:58:36] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10291601 (10MoritzMuehlenhoff) >>! In T378358#10289446, @Jhancock.wm wrote: > removed CPU 2. gonna let it run for a little and see if it generates errors. then we'll at least... [07:58:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:58:47] (03CR) 10Vgutierrez: [C:03+1] hiera: haproxykafka defaults to 2 workers [puppet] - 10https://gerrit.wikimedia.org/r/1087359 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [07:58:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:39] (03CR) 10Fabfur: [C:03+2] hiera: haproxykafka defaults to 2 workers [puppet] - 10https://gerrit.wikimedia.org/r/1087359 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [08:00:04] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0800). [08:00:05] Tchanders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:03:17] Are we OK to go ahead with the deployment window now? I can deploy [08:03:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 14593 [08:03:43] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 2828 [08:05:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:06:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2828 [08:08:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders) [08:09:43] (03Merged) 10jenkins-bot: temp accounts: Enable temp account creation on second-round pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders) [08:10:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:10:28] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1087195|temp accounts: Enable temp account creation on second-round pilots (T378336)]] [08:10:31] T378336: Temporary Accounts: Minor pilots - Nov 5 deploy - https://phabricator.wikimedia.org/T378336 [08:16:00] I'm getting backport failed: "'mwscript eval.php --wiki testwiki' generated unexpected output: Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234" [08:17:51] (03CR) 10Ayounsi: [C:03+2] Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [08:17:57] (03CR) 10CI reject: [V:04-1] Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [08:18:05] (03PS3) 10Ayounsi: Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 [08:19:12] RECOVERY - Memcached on idp1004 is OK: TCP OK - 0.000 second response time on 208.80.154.7 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [08:20:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:20:32] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [08:21:04] (03Merged) 10jenkins-bot: Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [08:24:02] Tchanders: it looks like that error already happened last night during the train presync, I'll see what I can dig up [08:24:21] !log uploaded ipip-multiqueue-optimizer 0.3+deb12u1 to apt.wm.o (bookworm) [08:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:26:27] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1087363 [08:27:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:29:55] jnuche: Thanks. Is it worth re-running in the meantime? [08:30:23] my guess is it will probably fail again [08:31:06] OK, I'll hold off [08:32:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:36:38] (03PS1) 10Fabfur: hiera: hpk batch_deadline on socket set to 1s [puppet] - 10https://gerrit.wikimedia.org/r/1087365 (https://phabricator.wikimedia.org/T374473) [08:37:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087365 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [08:39:09] (03CR) 10Vgutierrez: [C:03+1] hiera: hpk batch_deadline on socket set to 1s [puppet] - 10https://gerrit.wikimedia.org/r/1087365 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [08:39:12] jnuche: is there a task tracking the error? [08:40:07] (03CR) 10Fabfur: [C:03+2] hiera: hpk batch_deadline on socket set to 1s [puppet] - 10https://gerrit.wikimedia.org/r/1087365 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [08:40:13] kostajh: not yet, I'm putting together details to create one [08:41:11] it seems to me maybe the behavior of the mwscript "eval.php" has changed and now it needs network access, where before it didn't [08:41:57] kostajh: can I ping you in the story once it's ready? don't know if that sounds like something you're familiar with [08:44:15] (03PS1) 10Muehlenhoff: Disable installation of memcached on IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1087366 [08:44:52] jnuche: I could take a look at it, sure [08:44:56] jnuche: thanks for investigating. I'll abandon our deploy attempt for this window and try again next window [08:45:13] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087366 (owner: 10Muehlenhoff) [08:46:21] (03PS1) 10Fabfur: haproxykafka: restart service on config file changes [puppet] - 10https://gerrit.wikimedia.org/r/1087371 (https://phabricator.wikimedia.org/T374473) [08:47:26] jnuche: Since the patch got merged, should I run scap backport --revert to clean up? [08:48:34] Tchanders: yeah, probably that's a good idea, thx [08:48:45] (most likely will fail at the same place though) [08:49:10] (03PS1) 10Arnaudb: dbproxy: update grants with ip and fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) [08:49:10] (03CR) 10Arnaudb: "I suppose those templates are not applied automatically, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [08:50:04] (03PS1) 10TrainBranchBot: Revert "temp accounts: Enable temp account creation on second-round pilots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087373 [08:50:04] (03CR) 10TrainBranchBot: "tchanders@deploy2002 created a revert of this change as I8585140e96e99c710bc591455b86e9d19cc8ad88" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087195 (https://phabricator.wikimedia.org/T378336) (owner: 10Tchanders) [08:50:44] (03CR) 10Muehlenhoff: [C:03+2] Disable installation of memcached on IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1087366 (owner: 10Muehlenhoff) [08:51:10] I guess this is related to work around T341560? [08:51:11] T341560: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 [08:52:48] (03PS1) 10Arnaudb: dbproxy: switch CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) [08:56:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087373 (owner: 10TrainBranchBot) [08:56:55] (03Merged) 10jenkins-bot: Revert "temp accounts: Enable temp account creation on second-round pilots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087373 (owner: 10TrainBranchBot) [08:57:22] kostajh: mmh, not sure, the failing script is running on the deployment server directly, T341560 seems to be about mwmaint [08:57:22] T341560: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 [08:57:27] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1087373|Revert "temp accounts: Enable temp account creation on second-round pilots"]] [09:00:05] jnuche and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0900). nyaa~ [09:02:46] kostajh: T379044, my best guess is the script's behavior changed recently, but I'm gonna keep looking [09:02:47] T379044: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044 [09:05:03] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291704 (10kostajh) [09:06:34] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291711 (10kostajh) [09:06:34] (03PS1) 10Vgutierrez: liberica: fix hcforwarder configuration [puppet] - 10https://gerrit.wikimedia.org/r/1087376 (https://phabricator.wikimedia.org/T377127) [09:06:36] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291707 (10kostajh) p:05Triage→03Unbreak! Marking as UBN, as this is blocking train and backport deployments. [09:06:44] thanks jnuche. I added some tags and marked as a subtask of this week's train blockers [09:07:48] (03PS2) 10Vgutierrez: liberica: fix hcforwarder configuration [puppet] - 10https://gerrit.wikimedia.org/r/1087376 (https://phabricator.wikimedia.org/T377127) [09:08:27] thanks kostajh [09:08:57] any idea how be a good person to ask about this? [09:09:01] *would be [09:09:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087376 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:10:35] (03CR) 10Slyngshede: [C:03+2] P:firewall remove Icinga conntrack check [puppet] - 10https://gerrit.wikimedia.org/r/1085515 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [09:13:24] jnuche: maybe post in #wikimedia-sre with a link to the task? [09:13:52] I've added the `#SRE` tag to the task so it should get triaged quickly, but given the urgency seems like a good idea to message in IRC as well [09:14:17] sure [09:14:25] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:42] (03PS1) 10Slyngshede: P:firewall absent check_conntrack script. [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) [09:19:23] (03PS1) 10Vgutierrez: liberica: Provide ipip0 and ipip60 devices [puppet] - 10https://gerrit.wikimedia.org/r/1087381 (https://phabricator.wikimedia.org/T377127) [09:20:26] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291760 (10Joe) Investigating right now. It would help to know on what server this happened. I assume the active deployment server? [09:21:38] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291762 (10jnuche) > I assume the active deployment server? Yeah, on `deploy2002` [09:21:49] <_joe_> !log restarted rsyslog on deploy2002 T379044 [09:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:52] T379044: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044 [09:22:45] !log jnuche@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.2 refs T375661 [09:22:48] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [09:23:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087381 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:23:51] jnuche: can I try to sync the config patch after you're done with train deployment? [09:24:32] kostajh: that's no problem, failure is still happening though [09:25:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4453/console" [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [09:26:53] (03CR) 10Slyngshede: P:firewall absent check_conntrack script. [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [09:29:44] (03CR) 10Vgutierrez: [C:03+2] liberica: fix hcforwarder configuration [puppet] - 10https://gerrit.wikimedia.org/r/1087376 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:31:18] (03CR) 10Vgutierrez: [C:03+2] liberica: Provide ipip0 and ipip60 devices [puppet] - 10https://gerrit.wikimedia.org/r/1087381 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:31:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [09:31:46] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [09:31:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [09:33:30] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [09:34:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [09:37:39] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291795 (10Joe) Rsyslog had some tls errors so i restarted it, but i doubted it could be the real culprit, given it is reached via udp... [09:41:06] 06SRE, 10SRE-Access-Requests: Requesting access to snapshot* with group snapshot-admins for ebernhardson - https://phabricator.wikimedia.org/T379025#10291797 (10MatthewVernon) 05Open→03Resolved a:03Volans AFAICT the requested change was done by @Volans already, so closing this ticket. [09:41:55] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1087363 (owner: 10Muehlenhoff) [09:45:27] (03PS1) 10Alexandros Kosiaris: api|rest-gateway: Support sending request headers to upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087388 (https://phabricator.wikimedia.org/T374683) [09:45:47] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [09:45:49] (03PS1) 10MVernon: admin/data.yaml: set krb: present for jsn [puppet] - 10https://gerrit.wikimedia.org/r/1087389 (https://phabricator.wikimedia.org/T378786) [09:48:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087389 (https://phabricator.wikimedia.org/T378786) (owner: 10MVernon) [09:48:43] (03CR) 10MVernon: [C:03+2] admin/data.yaml: set krb: present for jsn [puppet] - 10https://gerrit.wikimedia.org/r/1087389 (https://phabricator.wikimedia.org/T378786) (owner: 10MVernon) [09:49:16] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [09:50:47] (03PS1) 10Muehlenhoff: Assign ganeti role for ganeti1041/ganeti1042 [puppet] - 10https://gerrit.wikimedia.org/r/1087393 (https://phabricator.wikimedia.org/T378921) [09:50:55] FIRING: MaxConntrack: Max conntrack at 93.3% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:52:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE, 13Patch-For-Review: Request Kerberos identity for jsn.sherman - https://phabricator.wikimedia.org/T378786#10291834 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi @jsn.sherman this is all done for you now. [09:53:50] (03PS1) 10Slyngshede: Revert "P:firewall remove Icinga conntrack check" [puppet] - 10https://gerrit.wikimedia.org/r/1087397 [09:55:28] (03CR) 10Muehlenhoff: [C:03+2] Assign ganeti role for ganeti1041/ganeti1042 [puppet] - 10https://gerrit.wikimedia.org/r/1087393 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [09:55:47] (03CR) 10CI reject: [V:04-1] Revert "P:firewall remove Icinga conntrack check" [puppet] - 10https://gerrit.wikimedia.org/r/1087397 (owner: 10Slyngshede) [09:56:37] (03PS2) 10Slyngshede: Revert "P:firewall remove Icinga conntrack check" [puppet] - 10https://gerrit.wikimedia.org/r/1087397 [09:57:55] (03CR) 10Daniel Kinzler: [C:03+1] "I'm not very familiar with helm charts, but the intent tooks right to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087388 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [09:58:06] (03CR) 10David Caro: [C:03+1] "LGTM, if you prefer not re-enabling it for everyone it could use some parameter (though complicates things)" [puppet] - 10https://gerrit.wikimedia.org/r/1087397 (owner: 10Slyngshede) [09:59:22] (03CR) 10Slyngshede: "It's fine, it's just not a very active alert, so the inconvenience is minimal." [puppet] - 10https://gerrit.wikimedia.org/r/1087397 (owner: 10Slyngshede) [09:59:39] (03CR) 10Slyngshede: [C:03+2] Revert "P:firewall remove Icinga conntrack check" [puppet] - 10https://gerrit.wikimedia.org/r/1087397 (owner: 10Slyngshede) [10:00:01] (03PS2) 10Alexandros Kosiaris: api|rest-gateway: Support sending request headers to upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087388 (https://phabricator.wikimedia.org/T374683) [10:00:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:00:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:00:55] RESOLVED: MaxConntrack: Max conntrack at 99.42% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [10:01:16] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291857 (10Joe) I would suggest, on the short term, to just run docker with `--network=host` and then check what log calls are being m... [10:03:45] (03Abandoned) 10Ayounsi: Add Netbox script to change a server's NIC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1012680 (https://phabricator.wikimedia.org/T360297) (owner: 10Ayounsi) [10:05:00] (03CR) 10Elukey: [V:03+1 C:03+2] profile::docker::report: use the internal registry endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1087205 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [10:05:08] (03CR) 10Elukey: [C:03+2] docker_registry_ha: reduce from 300 to 180 the nginx timeout [puppet] - 10https://gerrit.wikimedia.org/r/1087206 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [10:07:07] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10291863 (10jnuche) More details, successful backports were still happening yesterday, e.g.: https://sal.toolforge.org/log/6iIf-ZIBFFSC... [10:07:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bookworm [10:09:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance [10:09:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance [10:09:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1032 (T376905)', diff saved to https://phabricator.wikimedia.org/P70913 and previous config saved to /var/cache/conftool/dbconfig/20241105-100953-ladsgroup.json [10:11:32] !log set proxy timeouts of docker registry's nginx instances from 300s to 180s - T378618 [10:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1032 (T376905)', diff saved to https://phabricator.wikimedia.org/P70914 and previous config saved to /var/cache/conftool/dbconfig/20241105-101553-ladsgroup.json [10:19:44] (03PS1) 10Urbanecm: CirrusSearch: Disable updating weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087407 (https://phabricator.wikimedia.org/T378983) [10:21:01] (03PS1) 10Vgutierrez: liberica: Fix hcforwarder config (take two) [puppet] - 10https://gerrit.wikimedia.org/r/1087408 (https://phabricator.wikimedia.org/T377127) [10:21:49] (03PS4) 10Alexandros Kosiaris: Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) [10:22:22] (03CR) 10Btullis: [C:03+1] aqs1013 replaced by aqs1022 (hardware refresh) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans) [10:22:55] (03CR) 10DCausse: [C:03+1] "thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087407 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [10:23:20] thanks dcausse! [10:23:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087408 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [10:27:45] (03CR) 10Btullis: [C:03+1] "> Note that this requires an Airflow user with the same username as the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [10:28:15] (03CR) 10Hnowlan: [C:03+1] api|rest-gateway: Support sending request headers to upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087388 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [10:28:45] (03CR) 10Btullis: [C:03+1] "Great, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff) [10:29:07] (03CR) 10Brouberol: [C:03+2] "We absolutely could run it as a one-off CLI call, but we could also ensure these user/roles via a DAG. Whatever float our boat!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [10:29:10] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1087186 (owner: 10Muehlenhoff) [10:30:22] (03CR) 10Fabfur: [C:03+1] "makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/1087408 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [10:30:32] (03CR) 10Btullis: [C:03+1] global_config: expose additional ports on hadoop masters/workers [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:31:01] (03CR) 10Vgutierrez: [C:03+2] liberica: Fix hcforwarder config (take two) [puppet] - 10https://gerrit.wikimedia.org/r/1087408 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [10:31:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1032', diff saved to https://phabricator.wikimedia.org/P70915 and previous config saved to /var/cache/conftool/dbconfig/20241105-103101-ladsgroup.json [10:31:37] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:31:47] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: expose additional ports on hadoop masters/workers [puppet] - 10https://gerrit.wikimedia.org/r/1087135 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:31:58] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: define external services entries for the hive metastore servers [puppet] - 10https://gerrit.wikimedia.org/r/1087136 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [10:32:22] (03CR) 10Marostegui: [C:04-1] mariadb: productionize db2236 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [10:34:42] (03PS3) 10Arnaudb: mariadb: productionize db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579) [10:34:50] (03CR) 10Marostegui: "Grants were applied on the sections already?" [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [10:35:02] (03CR) 10Arnaudb: "good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [10:35:35] (03CR) 10Marostegui: "let's do one at the time. Let's start with m5" [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [10:35:48] (03CR) 10Arnaudb: "nope!" [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [10:35:54] (03CR) 10Marostegui: [C:03+1] mariadb: productionize db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [10:37:21] (03CR) 10Marostegui: [C:04-1] "Then that needs to be done before we commit this." [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [10:37:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10291974 (10Clement_Goubert) Thanks @Jhancock.wm ! [10:40:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:41:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:41:23] !log jnuche@deploy2002 Installing scap version "4.121.0" for 209 hosts [10:43:30] (03PS1) 10Muehlenhoff: Add a helper script to setup the Ganeti KVM vg [puppet] - 10https://gerrit.wikimedia.org/r/1087412 [10:43:58] (03PS2) 10Muehlenhoff: Add a helper script to setup the Ganeti LVM vg [puppet] - 10https://gerrit.wikimedia.org/r/1087412 [10:44:27] !log jnuche@deploy2002 install-world aborted: (no justification provided) (duration: 03m 09s) [10:44:40] (03CR) 10CI reject: [V:04-1] Add a helper script to setup the Ganeti LVM vg [puppet] - 10https://gerrit.wikimedia.org/r/1087412 (owner: 10Muehlenhoff) [10:46:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1032', diff saved to https://phabricator.wikimedia.org/P70916 and previous config saved to /var/cache/conftool/dbconfig/20241105-104608-ladsgroup.json [10:46:10] !log jnuche@deploy2002 Installing scap version "4.121.0" for 209 hosts [10:46:12] (03PS3) 10Muehlenhoff: Add a helper script to setup the Ganeti LVM vg [puppet] - 10https://gerrit.wikimedia.org/r/1087412 [10:47:04] (03CR) 10Peter Fischer: [C:03+1] CirrusSearch: Disable updating weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087407 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [10:48:25] jouncebot: nowandnext [10:48:26] For the next 0 hour(s) and 11 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T0900) [10:48:26] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1100) [10:50:38] (03PS2) 10Arnaudb: dbproxy: switch CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) [10:51:01] (03CR) 10Arnaudb: "done!" [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [10:52:13] (03CR) 10Marostegui: [C:04-1] "-1 as this is blocked on the grants" [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [10:52:14] urbanecm: I'm trying to solve a train blocker and would like to try to run the presync again, are you thinking of deploying something? [10:52:37] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:06] 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10292031 (10jijiki) >>! In T376400#10287901, @MatthewVernon wrote: > @jijiki can you expand on what you mean, please? This task is currently too broad... For the time being the task is delibe... [10:53:08] jnuche: was thinking of deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1087407, but i can wait for the nearest window, i don't want to interfere with the train :) [10:53:49] urbanecm: ty :) [10:55:46] running the train presync in the next few mins if I don't see any objections [10:56:28] (03PS20) 10Clément Goubert: Provide conftool data for mwcron and mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) [10:56:30] (03CR) 10Clément Goubert: Provide conftool data for mwcron and mwscript-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [10:58:38] (03CR) 10Fabfur: [C:03+2] hiera: fix haproxykafka workers number [puppet] - 10https://gerrit.wikimedia.org/r/1085465 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [10:59:51] (03Abandoned) 10Fabfur: hiera: fix haproxykafka workers number [puppet] - 10https://gerrit.wikimedia.org/r/1085465 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1100) [11:01:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1032 (T376905)', diff saved to https://phabricator.wikimedia.org/P70917 and previous config saved to /var/cache/conftool/dbconfig/20241105-110115-ladsgroup.json [11:01:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1029.eqiad.wmnet with reason: Maintenance [11:01:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1029.eqiad.wmnet with reason: Maintenance [11:01:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1029 (T376905)', diff saved to https://phabricator.wikimedia.org/P70918 and previous config saved to /var/cache/conftool/dbconfig/20241105-110139-ladsgroup.json [11:01:58] (03PS1) 10Vgutierrez: liberica: Set healthcheck type [puppet] - 10https://gerrit.wikimedia.org/r/1087417 (https://phabricator.wikimedia.org/T377127) [11:02:29] !log jnuche@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.2 refs T375661 [11:02:32] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [11:02:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087417 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [11:07:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1029 (T376905)', diff saved to https://phabricator.wikimedia.org/P70919 and previous config saved to /var/cache/conftool/dbconfig/20241105-110739-ladsgroup.json [11:09:23] (03PS1) 10Elukey: tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) [11:09:24] (03PS1) 10Elukey: Change port for kartotherian-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [11:09:26] (03PS1) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [11:10:00] (03CR) 10CI reject: [V:04-1] tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [11:10:16] (03PS1) 10Kosta Harlan: Revert^2 "temp accounts: Enable temp account creation on second-round pilots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087424 [11:11:16] (03PS2) 10Kosta Harlan: Revert^2 "temp accounts: Enable temp account creation on second-round pilots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087424 (https://phabricator.wikimedia.org/T378336) [11:12:40] urbanecm: I also have a patch to sync if scap is unbroken [11:17:00] 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10292157 (10jijiki) [11:17:21] 06SRE, 07SRE-Unowned, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10292158 (10jijiki) @MatthewVernon updated description [11:17:30] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1087417 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [11:18:59] (03PS2) 10Vgutierrez: liberica: Set healthcheck type [puppet] - 10https://gerrit.wikimedia.org/r/1087417 (https://phabricator.wikimedia.org/T377127) [11:19:08] (03CR) 10Vgutierrez: liberica: Set healthcheck type (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1087417 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [11:22:40] (03PS2) 10Elukey: tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) [11:22:40] (03PS2) 10Elukey: Change port for kartotherian-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [11:22:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1029', diff saved to https://phabricator.wikimedia.org/P70920 and previous config saved to /var/cache/conftool/dbconfig/20241105-112246-ladsgroup.json [11:23:26] (03CR) 10Vgutierrez: [C:03+2] liberica: Set healthcheck type [puppet] - 10https://gerrit.wikimedia.org/r/1087417 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [11:23:44] kostajh: not sure whether the message was for me, we're in mw infra window anyway, and i'm not doing any changes atm :) [11:24:52] urbanecm: it was intended for you, as I saw you were interested to sync a patch as well [11:25:45] (03CR) 10Effie Mouzeli: [C:03+1] Deprecate system::role for memcached/redis roles [puppet] - 10https://gerrit.wikimedia.org/r/1083160 (owner: 10Muehlenhoff) [11:25:50] ah, gotcha. [11:27:04] (03CR) 10Hnowlan: [C:03+1] kubernetes: fix hostnames for eqiad refresh and expansion [puppet] - 10https://gerrit.wikimedia.org/r/1087216 (https://phabricator.wikimedia.org/T376185) (owner: 10Clément Goubert) [11:30:10] jnuche: did it work? [11:31:23] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10292204 (10jnuche) [11:31:40] kostajh: yeah, train presync is currently running :) [11:31:40] kostajh: urbanecm: We don't have much to do in that window today afaict, so if you want to go ahead and take it to finish backports once scap's good, I think it's all right [11:32:00] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10292200 (10jnuche) p:05Unbreak!→03Triage Train now got past the failing stage. Removing this as a blocker [11:32:15] +1 [11:32:22] ty! [11:32:34] (03CR) 10Clément Goubert: [C:03+2] kubernetes: fix hostnames for eqiad refresh and expansion [puppet] - 10https://gerrit.wikimedia.org/r/1087216 (https://phabricator.wikimedia.org/T376185) (owner: 10Clément Goubert) [11:34:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[05-12] - https://phabricator.wikimedia.org/T377021#10292215 (10Clement_Goubert) Hostnames fixed in tasks and in puppet, sorry about that. [11:34:46] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10292216 (10Clement_Goubert) Hostnames fixed in tasks and in puppet, sorry about that. [11:37:15] (03CR) 10Hnowlan: [C:03+1] Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [11:37:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1029', diff saved to https://phabricator.wikimedia.org/P70921 and previous config saved to /var/cache/conftool/dbconfig/20241105-113754-ladsgroup.json [11:38:57] !log jnuche@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.2 refs T375661 (duration: 36m 28s) [11:39:05] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [11:39:57] presync done, I'm gonna promote to grup0 and then we can do the backports [11:40:12] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4454/c" [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [11:41:15] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087433 (https://phabricator.wikimedia.org/T375661) [11:41:16] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087433 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [11:41:44] (03CR) 10Hnowlan: [V:03+1 C:03+1] tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [11:41:50] (03CR) 10Hnowlan: [C:03+1] Change port for kartotherian-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [11:42:01] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087433 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [11:42:13] (03CR) 10Hnowlan: [C:03+1] profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [11:46:15] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1040 [11:47:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1040 [11:47:51] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1041 [11:49:03] (03PS1) 10Slyngshede: Netfilter: Route alerts for cloud hosts to WMCS. [alerts] - 10https://gerrit.wikimedia.org/r/1087434 [11:49:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1041 [11:52:11] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1042 [11:53:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1029 (T376905)', diff saved to https://phabricator.wikimedia.org/P70922 and previous config saved to /var/cache/conftool/dbconfig/20241105-115301-ladsgroup.json [11:53:18] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.2 refs T375661 [11:53:21] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [11:53:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1042 [11:55:57] checking everything looks healthy [11:57:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [11:58:14] well damn, I need to rollback... [11:58:26] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087435 [12:01:13] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087438 (https://phabricator.wikimedia.org/T375661) [12:01:14] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087438 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [12:01:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [12:02:00] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087438 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [12:02:27] !log jnuche@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.2 refs T375661 [12:02:29] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [12:02:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1040.eqiad.wmnet to cluster eqiad and group B [12:04:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1040.eqiad.wmnet to cluster eqiad and group B [12:10:10] !log jnuche@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.2 refs T375661 (duration: 07m 43s) [12:10:18] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [12:11:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10292360 (10MoritzMuehlenhoff) [12:11:34] 06SRE, 06Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411#10292358 (10cmooney) While thinking about T378346 it occurs to me there may be a half-way approach to all this which might work. In brief: * We provision hosts... [12:13:16] jnuche: i see the rollback finished, does that mean things are done (for now), please? [12:14:25] RESOLVED: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:34] kostajh, urbanecm: yeah, sorry, I was checking errors went back to normal after the rollback. You can go ahead with your backports from my side [12:14:42] thanks! [12:14:53] (03CR) 10Urbanecm: [C:03+2] CirrusSearch: Disable updating weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087407 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [12:15:33] (03Merged) 10jenkins-bot: CirrusSearch: Disable updating weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087407 (https://phabricator.wikimedia.org/T378983) (owner: 10Urbanecm) [12:16:09] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1087407|CirrusSearch: Disable updating weighted tags via EventBus (T378983 T377150)]] [12:16:13] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [12:16:13] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [12:16:22] urbanecm: lmk when you're done please [12:16:32] kostajh: will do! [12:17:37] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:17:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2230.codfw.wmnet with reason: testing [12:18:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2230.codfw.wmnet with reason: testing [12:18:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: testing [12:18:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: testing [12:19:49] (03CR) 10Tchanders: [C:03+1] Revert^2 "temp accounts: Enable temp account creation on second-round pilots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087424 (https://phabricator.wikimedia.org/T378336) (owner: 10Kosta Harlan) [12:22:47] (03PS1) 10Kosta Harlan: maintain-views.yml: Fix globalblocks filtering [puppet] - 10https://gerrit.wikimedia.org/r/1087443 (https://phabricator.wikimedia.org/T378994) [12:23:49] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087407|CirrusSearch: Disable updating weighted tags via EventBus (T378983 T377150)]] (duration: 07m 39s) [12:23:53] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [12:23:54] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [12:27:46] (03CR) 10FNegri: [C:03+1] "Adding Ben and Amir to reviewers, but I think this one can be merged straight away as it's just a simple fix." [puppet] - 10https://gerrit.wikimedia.org/r/1087443 (https://phabricator.wikimedia.org/T378994) (owner: 10Kosta Harlan) [12:28:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet [12:28:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10292394 (10ops-monitoring-bot) Draining ganeti1013.eqiad.wmnet of running VMs [12:30:13] kostajh: deployment done [12:30:31] Thanks [12:32:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1013.eqiad.wmnet [12:32:18] (03CR) 10Marostegui: [C:03+1] maintain-views.yml: Fix globalblocks filtering [puppet] - 10https://gerrit.wikimedia.org/r/1087443 (https://phabricator.wikimedia.org/T378994) (owner: 10Kosta Harlan) [12:32:51] (03CR) 10FNegri: [C:03+2] maintain-views.yml: Fix globalblocks filtering [puppet] - 10https://gerrit.wikimedia.org/r/1087443 (https://phabricator.wikimedia.org/T378994) (owner: 10Kosta Harlan) [12:33:02] !log mwmaint2002: kill all instances of refreshLinkRecommendation (T378983) [12:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:05] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [12:33:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet [12:33:27] !log eswiki,x1: `delete from growthexperiments_link_recommendations where gelr_page=10598298;` (to verify updates are flowing in; T378983) [12:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:14] voila, update is here! [12:34:25] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [12:34:25] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=93) [12:35:00] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [12:35:01] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=93) [12:35:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10292423 (10ops-monitoring-bot) Draining ganeti1013.eqiad.wmnet of running VMs [12:35:38] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [12:36:25] FIRING: [6x] SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s1.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:46] jouncebot: nowandnext [12:36:47] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [12:36:47] In 0 hour(s) and 23 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1300) [12:36:58] ok, I'm going ahead with the temp accounts patch [12:38:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087424 (https://phabricator.wikimedia.org/T378336) (owner: 10Kosta Harlan) [12:38:46] (03Merged) 10jenkins-bot: Revert^2 "temp accounts: Enable temp account creation on second-round pilots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087424 (https://phabricator.wikimedia.org/T378336) (owner: 10Kosta Harlan) [12:39:11] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1087424|Revert^2 "temp accounts: Enable temp account creation on second-round pilots" (T378336)]] [12:39:21] T378336: Temporary Accounts: Minor pilots - Nov 5 deploy - https://phabricator.wikimedia.org/T378336 [12:40:26] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [12:42:01] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1087424|Revert^2 "temp accounts: Enable temp account creation on second-round pilots" (T378336)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:46:13] !log kharlan@deploy2002 kharlan: Continuing with sync [12:49:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [12:50:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10292432 (10Marostegui) es2022 → will be replaced by es2043, server can be installed in C6: There is already one es in C6, better to look for a differen... [12:50:58] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087424|Revert^2 "temp accounts: Enable temp account creation on second-round pilots" (T378336)]] (duration: 11m 46s) [12:51:01] T378336: Temporary Accounts: Minor pilots - Nov 5 deploy - https://phabricator.wikimedia.org/T378336 [12:56:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [12:57:40] I'm finished [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1300) [13:03:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [13:03:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10292455 (10Clement_Goubert) 05In progress→03Resolved RAID is now rebuilt. [13:05:40] (03CR) 10Cathal Mooney: [C:03+2] Only try to find 'real' netmask for IPs if they are /32 or /128 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1085590 (https://phabricator.wikimedia.org/T378751) (owner: 10Cathal Mooney) [13:07:49] (03Merged) 10jenkins-bot: Only try to find 'real' netmask for IPs if they are /32 or /128 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1085590 (https://phabricator.wikimedia.org/T378751) (owner: 10Cathal Mooney) [13:08:42] !log installing php7.4 security updates on remaining non-wikikube servers T378173 [13:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:17] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:09:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:09:58] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:10:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [13:10:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:11:06] (03PS2) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [13:11:56] (03CR) 10Elukey: "Hey Valentin, adding you to verify that I am not doing something horrible on the nginx side. This config is almost deprecated and used by " [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [13:16:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [13:16:44] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [13:16:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [13:24:34] (03PS1) 10Brouberol: airflow: export a PYTHONPATH env var reflecting the new bullseye based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087454 (https://phabricator.wikimedia.org/T377928) [13:24:48] (03PS1) 10Gergő Tisza: JsonConfig: Disable TrackGlobalJsonLinks to avoid missing table errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087455 (https://phabricator.wikimedia.org/T379067) [13:26:27] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks to both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087388 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [13:27:16] (03CR) 10Btullis: [C:03+1] airflow: export a PYTHONPATH env var reflecting the new bullseye based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087454 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [13:27:33] (03Merged) 10jenkins-bot: api|rest-gateway: Support sending request headers to upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087388 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [13:28:22] (03CR) 10Brouberol: [C:03+2] airflow: export a PYTHONPATH env var reflecting the new bullseye based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087454 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [13:28:26] (03PS1) 10Arnaudb: dotfiles: add ssh config for mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1087457 (https://phabricator.wikimedia.org/T378068) [13:28:28] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10292537 (10Petrb) 05Open→03Resolved a:03Petrb Hello, new version of Huggle was released that contains a patched libirc http... [13:29:09] (03CR) 10Dreamy Jazz: [C:03+1] JsonConfig: Disable TrackGlobalJsonLinks to avoid missing table errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087455 (https://phabricator.wikimedia.org/T379067) (owner: 10Gergő Tisza) [13:29:38] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:31:25] RESOLVED: [6x] SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s1.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:58] (03CR) 10Bvibber: [C:03+1] "Does exactly what I'd have done for now. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087455 (https://phabricator.wikimedia.org/T379067) (owner: 10Gergő Tisza) [13:33:33] 10ops-eqiad, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10292551 (10Marostegui) 05Resolved→03Open @papaul @Jclark-ctr @VRiley-WMF I have reopened this ticket because I do think this is a HW issue and it is in fact a recurr... [13:33:49] 10ops-eqiad, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10292557 (10Marostegui) [13:34:28] !log imported jenkins 2.479.1 to thirdparty/ci for bullseye-wikimedia T379059 [13:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:30] ^ hashar [13:34:30] T379059: Upgrade Jenkins instances to 2.479.1 - https://phabricator.wikimedia.org/T379059 [13:35:11] moritzm: Danke Schon! [13:36:25] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072 (10cmooney) 03NEW p:05Triage→03Low [13:36:39] 06SRE, 06Infrastructure-Foundations, 10netops: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port - https://phabricator.wikimedia.org/T375216#10292564 (10ayounsi) I added some logging (`self.log_info(f"{interface} {interface.enabled} {interface.untagged_vlan} {interface.tagge... [13:39:47] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:41:40] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:42:29] (03PS1) 10Brouberol: airflow: create the kerberos token PVC even if kerberos is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087459 (https://phabricator.wikimedia.org/T375875) [13:42:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:42:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:44:41] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:44:51] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:49:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087455 (https://phabricator.wikimedia.org/T379067) (owner: 10Gergő Tisza) [13:52:19] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:52:28] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:53:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [13:54:05] !log reimage pc1017 T378068 [13:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:08] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [13:55:56] (03PS1) 10Alexandros Kosiaris: Fixup for rest-gateway's request_headers_to_add [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087462 (https://phabricator.wikimedia.org/T374683) [13:57:13] !log installed libapache2-mod-auth-openidc bugfix updates from Bookworm point release [13:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10292648 (10MoritzMuehlenhoff) [13:57:42] (03CR) 10Alexandros Kosiaris: [C:03+2] Fixup for rest-gateway's request_headers_to_add [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087462 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [13:58:43] 06SRE, 06Infrastructure-Foundations, 10netops: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port - https://phabricator.wikimedia.org/T375216#10292651 (10ayounsi) Another point, after running the script, the changelog on a problematic interface shows 3 changes (for that inter... [13:58:49] (03Merged) 10jenkins-bot: Fixup for rest-gateway's request_headers_to_add [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087462 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1400). [14:00:05] tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:24] o/ [14:01:25] I can deploy, if someone can test the fix for T379067 [14:01:25] T379067: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwikidatawiki.globaljsonlinks' doesn't existFunction: JsonConfig\GlobalJsonLinks::getLinksFromPageQuery: SELECT gjlt_namespace,gjlt_title FROM `globaljsonlinks` JOIN `glob - https://phabricator.wikimedia.org/T379067 [14:03:27] so far I haven’t been able to reproduce it, which is annoying [14:05:31] ok, evidently I managed to trigger it, it’s just not user-visible [14:05:40] o/ [14:05:42] Lucas_WMDE: according to the logs, it seems to appear in deferred update [14:05:48] because post output deferred updates fun [14:05:49] yup [14:05:57] tgr|away: want to self-serve or should I deploy? [14:06:51] seems to be reproducible pretty reliably, just make a page edit [14:06:56] (e.g. on https://test.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)/sandbox) [14:07:11] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:07:20] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:07:40] Lucas_WMDE: thx, I can deploy it [14:07:48] ok! [14:07:52] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:08:00] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:08:30] !log installing PHP 7.4 security updates on bullseye (as packaged in Debian) [14:08:30] (reproducible on mwdebug too, yay) [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [14:10:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087455 (https://phabricator.wikimedia.org/T379067) (owner: 10Gergő Tisza) [14:10:36] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:10:50] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:11:00] (03Merged) 10jenkins-bot: JsonConfig: Disable TrackGlobalJsonLinks to avoid missing table errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087455 (https://phabricator.wikimedia.org/T379067) (owner: 10Gergő Tisza) [14:11:02] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1087230 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:11:28] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1087455|JsonConfig: Disable TrackGlobalJsonLinks to avoid missing table errors (T379067)]] [14:11:37] T379067: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwikidatawiki.globaljsonlinks' doesn't existFunction: JsonConfig\GlobalJsonLinks::getLinksFromPageQuery: SELECT gjlt_namespace,gjlt_title FROM `globaljsonlinks` JOIN `glob - https://phabricator.wikimedia.org/T379067 [14:12:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [14:12:43] (03CR) 10Marostegui: [C:03+1] dotfiles: add ssh config for mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1087457 (https://phabricator.wikimedia.org/T378068) (owner: 10Arnaudb) [14:12:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10292711 (10MoritzMuehlenhoff) [14:12:56] (03CR) 10Arnaudb: [C:03+2] dotfiles: add ssh config for mgmt [puppet] - 10https://gerrit.wikimedia.org/r/1087457 (https://phabricator.wikimedia.org/T378068) (owner: 10Arnaudb) [14:15:01] well this is new [14:15:09] 14:13:47 Check 'check_testservers_k8s-1_of_1' failed: Sending to mwdebug.discovery.wmnet... [14:15:17] Location header: expected 'http://it.wikipedia.org/wiki/Saturno_(astronomia)?a=test', was missing. [14:15:41] FAIL: 131 requests sent to mwdebug.discovery.wmnet. 1 request with failed assertions. [14:15:59] tgr|away: i'd say rerun the checks [14:16:08] !log tgr@deploy2002 tgr: Backport for [[gerrit:1087455|JsonConfig: Disable TrackGlobalJsonLinks to avoid missing table errors (T379067)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:16:11] i saw flaky checks like this before [14:16:11] yeah, it happens sometimes for me (maybe once every few months) [14:16:18] ^^ [14:16:35] did it say that the HTTP status was 500, or something else? [14:16:54] 503 [14:17:25] very likely transient then. might be worth filling a "this happens every now and then" task if it doesn't eixst yet [14:21:19] only task I’m aware of is T364886 [14:21:20] T364886: httpbb should show more information / details about failed checks - https://phabricator.wikimedia.org/T364886 [14:22:06] (03CR) 10Vgutierrez: "sure, no problem 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [14:22:26] (03PS1) 10Arnaudb: mariadb: wipe /srv on pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1087468 (https://phabricator.wikimedia.org/T378068) [14:24:05] !log tgr@deploy2002 tgr: Continuing with sync [14:28:52] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087455|JsonConfig: Disable TrackGlobalJsonLinks to avoid missing table errors (T379067)]] (duration: 17m 24s) [14:28:55] T379067: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwikidatawiki.globaljsonlinks' doesn't existFunction: JsonConfig\GlobalJsonLinks::getLinksFromPageQuery: SELECT gjlt_namespace,gjlt_title FROM `globaljsonlinks` JOIN `glob - https://phabricator.wikimedia.org/T379067 [14:29:11] !log upload liberica 0.3 to apt.wm.o (bookworm-wikimedia) [14:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1026.eqiad.wmnet with reason: Maintenance [14:29:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1026.eqiad.wmnet with reason: Maintenance [14:30:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1026 (T376905)', diff saved to https://phabricator.wikimedia.org/P70926 and previous config saved to /var/cache/conftool/dbconfig/20241105-142959-ladsgroup.json [14:30:03] (03PS3) 10Elukey: tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) [14:30:03] (03PS3) 10Elukey: Change port for kartotherian-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [14:30:03] (03PS3) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [14:30:19] (03CR) 10Elukey: tlsproxy::localssl: allow multiple listens for tls ports (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [14:31:48] nothing else to deploy, I think? [14:32:37] (03CR) 10Marostegui: [C:03+1] mariadb: wipe /srv on pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1087468 (https://phabricator.wikimedia.org/T378068) (owner: 10Arnaudb) [14:32:48] (03CR) 10Arnaudb: [C:03+2] mariadb: wipe /srv on pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1087468 (https://phabricator.wikimedia.org/T378068) (owner: 10Arnaudb) [14:32:48] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4455/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [14:32:49] !log UTC afternoon deploys done [14:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1017.eqiad.wmnet with OS bookworm [14:34:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [14:35:18] (03CR) 10Vgutierrez: [C:03+1] tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [14:35:40] (03PS1) 10Slyngshede: Account Managers: Allow account managers to be assigned by LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 [14:35:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1026 (T376905)', diff saved to https://phabricator.wikimedia.org/P70927 and previous config saved to /var/cache/conftool/dbconfig/20241105-143552-ladsgroup.json [14:37:37] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:56] I'm gonna use the remainder of the window to roll forward the train in a couple mins [14:38:09] (03CR) 10Muehlenhoff: [C:03+2] spark: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087186 (owner: 10Muehlenhoff) [14:38:13] (03PS3) 10CDanis: Create releases for chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [14:38:27] ok! [14:39:26] :) [14:39:50] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087478 (https://phabricator.wikimedia.org/T375661) [14:39:52] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087478 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [14:40:02] (03PS1) 10Muehlenhoff: Revert "spark: Avoid Ferm-specific syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1087479 [14:40:36] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087478 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [14:42:17] (03CR) 10Muehlenhoff: [C:03+2] Revert "spark: Avoid Ferm-specific syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1087479 (owner: 10Muehlenhoff) [14:42:30] (03CR) 10CDanis: [C:03+2] Create releases for chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [14:42:47] (03PS4) 10Elukey: Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [14:42:47] (03PS4) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [14:43:28] (03Merged) 10jenkins-bot: Create releases for chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [14:44:16] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [14:44:38] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [14:47:18] (03PS4) 10FNegri: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) [14:48:06] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.2 refs T375661 [14:48:09] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [14:48:26] (03PS1) 10Slyngshede: P:idm Assign permission to TarLogic. [puppet] - 10https://gerrit.wikimedia.org/r/1087487 (https://phabricator.wikimedia.org/T352144) [14:48:45] (03PS1) 10Muehlenhoff: spark: Avoid Ferm-specific syntax (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1087488 [14:49:05] (03CR) 10FNegri: [C:03+2] alertmanager: fix WMCS template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [14:50:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10292809 (10Gehel) a:05bking→03None [14:50:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [14:51:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1026', diff saved to https://phabricator.wikimedia.org/P70928 and previous config saved to /var/cache/conftool/dbconfig/20241105-145059-ladsgroup.json [14:52:03] jnuche: if the train looks good, I will update the CI Jenkins [14:53:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [14:54:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087488 (owner: 10Muehlenhoff) [14:54:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087487 (https://phabricator.wikimedia.org/T352144) (owner: 10Slyngshede) [14:55:09] hashar: there's another new issue which is probably another blocker, but it affects only one wiki for now so I'm not rolling back [14:55:14] you can update the CI jenkins :) [14:55:46] great! [14:59:56] (03PS1) 10FNegri: Update .pint.hcl syntax for pint >=0.64 [alerts] - 10https://gerrit.wikimedia.org/r/1087490 [15:01:08] (03CR) 10CI reject: [V:04-1] Update .pint.hcl syntax for pint >=0.64 [alerts] - 10https://gerrit.wikimedia.org/r/1087490 (owner: 10FNegri) [15:01:08] !log Upgrading CI Jenkins | T379059 [15:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:12] T379059: Upgrade Jenkins instances to 2.479.1 - https://phabricator.wikimedia.org/T379059 [15:01:58] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [15:02:10] oh nice [15:02:19] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [15:02:23] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [15:02:36] the agents can't connect over ssh [15:02:37] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:56] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [15:05:12] (03CR) 10Slyngshede: [C:03+2] P:idm Assign permission to TarLogic. [puppet] - 10https://gerrit.wikimedia.org/r/1087487 (https://phabricator.wikimedia.org/T352144) (owner: 10Slyngshede) [15:06:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1026', diff saved to https://phabricator.wikimedia.org/P70929 and previous config saved to /var/cache/conftool/dbconfig/20241105-150607-ladsgroup.json [15:06:13] 06SRE, 06Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411#10292867 (10cmooney) I spoke with @ayounsi on irc about this approach and he expressed some concerns with the use of a cookbook to manage the host's network config... [15:12:25] (03PS2) 10Muehlenhoff: Remove spark2 profile [puppet] - 10https://gerrit.wikimedia.org/r/1087187 [15:12:37] I gotta rollback [15:12:50] short of installaing Java 17 on the agents :D [15:13:47] what I don't get is that I was sure I had all the agents upgraded bah [15:14:30] better now than when there's the next Jenkins security release (and then we can only move to the current LTS) [15:14:36] (03PS5) 10Elukey: Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [15:14:37] (03PS5) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [15:15:19] and of course I can't login Horizon [15:15:19] bah [15:15:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1017.eqiad.wmnet with OS bookworm [15:16:18] https://openstack.eqiad1.wikimediacloud.org:25000/protected goes in a loop [15:16:28] ah I got in eventually [15:18:09] !log Switched WMCS integration instances from Java 11 to Java 17 via Horizon project wide config. That was forgotten in T359795 and blocks today Jenkins upgrade ( T379059 ) [15:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:20] T359795: Switch Jenkins instances from Java 11 to Java 17 - https://phabricator.wikimedia.org/T359795 [15:18:20] T379059: Upgrade Jenkins instances to 2.479.1 - https://phabricator.wikimedia.org/T379059 [15:18:51] https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1044/log [15:18:53] \o/ [15:19:12] moritzm: yup, thank you for the positive mood! [15:19:56] I don't know why I forgot the WMCS instances [15:20:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [15:20:42] (03CR) 10Vgutierrez: [C:03+1] varnish: Move wm_recv_early subroutine to inline [puppet] - 10https://gerrit.wikimedia.org/r/1083913 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [15:21:06] (03CR) 10Vgutierrez: [C:03+1] varnish: Move wm_recv_purge subroutine to inline [puppet] - 10https://gerrit.wikimedia.org/r/1083914 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [15:21:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1026 (T376905)', diff saved to https://phabricator.wikimedia.org/P70931 and previous config saved to /var/cache/conftool/dbconfig/20241105-152114-ladsgroup.json [15:21:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [15:21:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [15:21:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1033 (T376905)', diff saved to https://phabricator.wikimedia.org/P70932 and previous config saved to /var/cache/conftool/dbconfig/20241105-152139-ladsgroup.json [15:27:54] !log Switched deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud to Java 17 # T359795 [15:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:57] T359795: Switch Jenkins instances from Java 11 to Java 17 - https://phabricator.wikimedia.org/T359795 [15:28:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1033 (T376905)', diff saved to https://phabricator.wikimedia.org/P70933 and previous config saved to /var/cache/conftool/dbconfig/20241105-152819-ladsgroup.json [15:29:20] (03CR) 10BCornwall: [C:03+2] varnish: Move wm_recv_purge subroutine to inline [puppet] - 10https://gerrit.wikimedia.org/r/1083914 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [15:29:20] (03CR) 10BCornwall: [C:03+2] varnish: Move wm_recv_early subroutine to inline [puppet] - 10https://gerrit.wikimedia.org/r/1083913 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [15:29:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Request Kerberos identity for jsn.sherman - https://phabricator.wikimedia.org/T378786#10292979 (10jsn.sherman) >>! In T378786#10291834, @MatthewVernon wrote: > Hi @jsn.sherman this is all done for you now. Thanks @MatthewVernon! I w... [15:32:36] !log Switched PCC workers to Java 17 via https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-pcc-worker # T359795 [15:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:38] (03PS1) 10Arnaudb: Revert "mariadb: wipe /srv on pc1017" [puppet] - 10https://gerrit.wikimedia.org/r/1087499 [15:37:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [15:37:28] (03PS1) 10Brouberol: global_config: register the Yarn resourcemanager IPC port in the hadoop service [puppet] - 10https://gerrit.wikimedia.org/r/1087500 (https://phabricator.wikimedia.org/T377602) [15:38:05] (03CR) 10CI reject: [V:04-1] global_config: register the Yarn resourcemanager IPC port in the hadoop service [puppet] - 10https://gerrit.wikimedia.org/r/1087500 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [15:40:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [15:41:18] jouncebot: nowandnext [15:41:19] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [15:41:19] In 0 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1600) [15:41:40] (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/1087494 will probably want backporting soonish, though it should go through gate-and-submit on master first) [15:41:56] I have successfully upgraded the CI Jenkins [15:42:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff) [15:42:40] ah, I thought things looked slightly different :) [15:42:47] yay for upgrades! [15:42:55] jnuche: I have upgraded the CI Jenkins successfully! The releases one can wait next week, there is no security update in that LTS version [15:43:14] the changelog at https://www.jenkins.io/changelog-stable/ lists a bunch of UI improvements [15:43:19] Lucas_WMDE: Can I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1083146 first, so I can use your backport to see if I didn't break anything existing? [15:43:25] Enhancements and refinements for the appearance of several pages in Jenkins. pull 9521, pull 9707, pull 9461, pull 9411, pull 9393, pull 9381 [15:43:25] Refinements and modernizations to sections of the Jenkins UI. pull 9453, pull 9380, pull 9365, pull 9395, pull 9641 [15:43:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1033', diff saved to https://phabricator.wikimedia.org/P70934 and previous config saved to /var/cache/conftool/dbconfig/20241105-154326-ladsgroup.json [15:43:44] claime: as far as I’m concerned, sure [15:43:50] (03CR) 10Scott French: [C:03+1] Provide conftool data for mwcron and mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:43:50] not sure it’s “my” backport to begin with ^^ [15:43:56] we’ll see [15:44:08] s/your/that/ :p [15:44:26] :P [15:44:32] (03CR) 10Clément Goubert: [C:03+2] Provide conftool data for mwcron and mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1083146 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:45:53] (03PS2) 10Brouberol: global_config: register the Yarn resourcemanager IPC port in the hadoop service [puppet] - 10https://gerrit.wikimedia.org/r/1087500 (https://phabricator.wikimedia.org/T377602) [15:46:14] hashar: ack! [15:47:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1013.eqiad.wmnet [15:48:09] !log remove ganeti1013 from active ganeti nodes T378921 [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:24] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [15:50:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1041.eqiad.wmnet to cluster eqiad and group B [15:50:34] PROBLEM - ganeti-confd running on ganeti1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:51:00] PROBLEM - ganeti-noded running on ganeti1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:51:29] (03CR) 10Scott French: [C:03+2] trafficserver: Lua script for routing 8.1-enrolled traffic [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [15:51:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1041.eqiad.wmnet to cluster eqiad and group B [15:51:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1042.eqiad.wmnet to cluster eqiad and group B [15:51:57] (03CR) 10Btullis: [C:03+1] global_config: register the Yarn resourcemanager IPC port in the hadoop service [puppet] - 10https://gerrit.wikimedia.org/r/1087500 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [15:52:37] FIRING: ProbeDown: Service ganeti1013:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1042.eqiad.wmnet to cluster eqiad and group B [15:53:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [15:53:46] (03CR) 10Eevans: [C:03+2] aqs1013 replaced by aqs1022 (hardware refresh) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087240 (https://phabricator.wikimedia.org/T379026) (owner: 10Eevans) [15:54:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [15:54:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1014.eqiad.wmnet [15:54:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [15:54:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10293116 (10ops-monitoring-bot) Draining ganeti1014.eqiad.wmnet of running VMs [15:55:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10293120 (10MoritzMuehlenhoff) [15:56:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10293134 (10MoritzMuehlenhoff) [15:56:38] (03PS1) 10JHathaway: ms-be: partman EFI recipe [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) [15:58:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [15:58:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1033', diff saved to https://phabricator.wikimedia.org/P70935 and previous config saved to /var/cache/conftool/dbconfig/20241105-155833-ladsgroup.json [15:58:50] Lucas_WMDE: Done, should be all good [15:59:12] (03PS1) 10Lucas Werkmeister (WMDE): Fixup paths to moved resources [extensions/ConfirmEdit] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087507 (https://phabricator.wikimedia.org/T379080) [15:59:19] uploaded the backport ^ [15:59:33] Reedy: do you want to deploy or should I? [16:00:05] eoghan, jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1600). nyaa~ [16:00:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [16:00:10] feel free if you're bored :P [16:00:15] ^^ [16:00:16] But I can do it if you'd rather [16:00:45] I think I should file a bug about that... It feels like something that should be tested by CI [16:00:57] I don’t mind either way [16:01:09] (T290932 is a more general “make remoteExtPath better” task, I suppose) [16:01:10] T290932: Figure out remoteExtPath/remoteBasePath automatically for the common case - https://phabricator.wikimedia.org/T290932 [16:01:11] (03CR) 10Elukey: "Left a nit, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [16:01:18] (03CR) 10Elukey: [C:03+1] ms-be: partman EFI recipe [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [16:01:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1017.eqiad.wmnet with OS bookworm [16:02:55] (03PS2) 10JHathaway: ms-be: partman EFI recipe [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) [16:02:57] eoghan, jelto, arnoldokoth, mutante: there’s a train blocker fix (T379080) that would be nice to backport soon, is it okay to do it during your window? [16:02:58] T379080: RuntimeException: package file not found or not a file: "/srv/mediawiki/php-1.44.0-wmf.2/extensions/ConfirmEdit/FancyCaptcha/resources/ext.confirmEdit.fancyCaptcha.js" - https://phabricator.wikimedia.org/T379080 [16:03:05] (03CR) 10JHathaway: ms-be: partman EFI recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [16:03:35] (03CR) 10Elukey: "More info: https://wikitech.wikimedia.org/wiki/UEFI_Boot" [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [16:04:15] (03CR) 10JHathaway: ms-be: partman EFI recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [16:05:32] (03CR) 10Ladsgroup: [C:03+2] Fixup paths to moved resources [extensions/ConfirmEdit] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087507 (https://phabricator.wikimedia.org/T379080) (owner: 10Lucas Werkmeister (WMDE)) [16:05:51] Lucas_WMDE: go for it! [16:06:23] ok ^^ [16:06:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087507 (https://phabricator.wikimedia.org/T379080) (owner: 10Lucas Werkmeister (WMDE)) [16:06:39] if anyone things I shouldn’t deploy, you still have some time while gate-and-submit runs [16:07:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:07:30] all good from the conductor side (and thank you!) [16:07:46] (03CR) 10Muehlenhoff: [C:03+2] Remove spark2 profile [puppet] - 10https://gerrit.wikimedia.org/r/1087187 (owner: 10Muehlenhoff) [16:09:00] * Lucas_WMDE checks if the issue is reproducible [16:09:29] (03CR) 10Brouberol: [C:03+2] global_config: register the Yarn resourcemanager IPC port in the hadoop service [puppet] - 10https://gerrit.wikimedia.org/r/1087500 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [16:09:46] In theory, some missing JS/less/css... whether it's noticeable from the UI.. :P [16:10:11] https://www.mediawiki.org/w/load.php?lang=en&modules=ext.confirmEdit.fancyCaptcha is good enough for me, that shows the error ^^ [16:10:15] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [16:12:34] (https://www.mediawiki.org/wiki/Special:AbuseFilter/?rulescope=global&deletedfilters=hide&limit=500 looks like all the showcaptcha abuse filters are private so I wouldn’t know how to trigger them anyway) [16:12:45] (and I’m happy to stay ignorant there) [16:13:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1033 (T376905)', diff saved to https://phabricator.wikimedia.org/P70936 and previous config saved to /var/cache/conftool/dbconfig/20241105-161340-ladsgroup.json [16:14:23] (03PS1) 10Muehlenhoff: dns::auth::update: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087508 [16:14:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: Maintenance [16:14:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: Maintenance [16:14:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1031 (T376905)', diff saved to https://phabricator.wikimedia.org/P70937 and previous config saved to /var/cache/conftool/dbconfig/20241105-161455-ladsgroup.json [16:17:25] (03PS1) 10Dzahn: admin: add a yubikey SSH key to user dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1087509 [16:18:57] RESOLVED: ProbeDown: Service ganeti1013:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T376905)', diff saved to https://phabricator.wikimedia.org/P70938 and previous config saved to /var/cache/conftool/dbconfig/20241105-162048-ladsgroup.json [16:22:47] (03PS1) 10CDanis: Add service chart-renderer to k8s ingress as a/a [dns] - 10https://gerrit.wikimedia.org/r/1087511 (https://phabricator.wikimedia.org/T372081) [16:27:07] (03CR) 10Alexandros Kosiaris: [C:03+1] Add service chart-renderer to k8s ingress as a/a [dns] - 10https://gerrit.wikimedia.org/r/1087511 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [16:28:23] (03CR) 10Kamila Součková: [C:03+1] Add service chart-renderer to k8s ingress as a/a [dns] - 10https://gerrit.wikimedia.org/r/1087511 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [16:28:39] (03Merged) 10jenkins-bot: Fixup paths to moved resources [extensions/ConfirmEdit] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087507 (https://phabricator.wikimedia.org/T379080) (owner: 10Lucas Werkmeister (WMDE)) [16:29:13] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1087507|Fixup paths to moved resources (T379080)]] [16:29:26] (03CR) 10CDanis: [C:03+2] Add service chart-renderer to k8s ingress as a/a [dns] - 10https://gerrit.wikimedia.org/r/1087511 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [16:29:26] T379080: RuntimeException: package file not found or not a file: "/srv/mediawiki/php-1.44.0-wmf.2/extensions/ConfirmEdit/FancyCaptcha/resources/ext.confirmEdit.fancyCaptcha.js" - https://phabricator.wikimedia.org/T379080 [16:32:04] !log cdanis@cumin1002 START - Cookbook sre.dns.netbox [16:32:13] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1087507|Fixup paths to moved resources (T379080)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:32:27] https://www.mediawiki.org/w/load.php?lang=en&modules=ext.confirmEdit.fancyCaptcha works with mwdebug \o/ [16:32:29] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [16:34:20] !log cdanis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:46] (03PS1) 10Eevans: remove outdated (and incorrect) host entries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 [16:35:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P70939 and previous config saved to /var/cache/conftool/dbconfig/20241105-163556-ladsgroup.json [16:37:15] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087507|Fixup paths to moved resources (T379080)]] (duration: 08m 02s) [16:37:18] T379080: RuntimeException: package file not found or not a file: "/srv/mediawiki/php-1.44.0-wmf.2/extensions/ConfirmEdit/FancyCaptcha/resources/ext.confirmEdit.fancyCaptcha.js" - https://phabricator.wikimedia.org/T379080 [16:38:21] * Lucas_WMDE done deploying [16:38:45] (03CR) 10Eevans: "I'm not completely certain of the rationale provided in the commit message, but I'm employing [Cunningham's Law](https://meta.wikimedia.or" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [16:39:20] (03PS1) 10CDanis: service catalog: add chart-renderer in service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1087515 (https://phabricator.wikimedia.org/T372081) [16:40:54] (03CR) 10Kamila Součková: [C:03+1] service catalog: add chart-renderer in service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1087515 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [16:43:00] (03PS1) 10Vgutierrez: prometheus::ops: Scrape liberica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1087517 (https://phabricator.wikimedia.org/T377127) [16:43:45] (03CR) 10Hnowlan: "I kinda suspect the config rendered by an empty array here will just break the service when used in a test setting. The rationale for this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [16:46:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087517 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [16:50:54] (03CR) 10Scott French: "Agreed, yeah - this is roughly what I suspected when this question came up on I5bae756dfa8ed3d0e22d2281d4c45e68a6236be7 (e.g., possibly us" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [16:51:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P70940 and previous config saved to /var/cache/conftool/dbconfig/20241105-165103-ladsgroup.json [16:51:39] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [16:54:55] (03CR) 10Ssingh: [C:03+1] prometheus::ops: Scrape liberica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1087517 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [16:55:18] (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Scrape liberica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1087517 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [16:57:46] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [16:58:05] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [16:58:57] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10293477 (10thcipriani) >>! In T379044#10291795, @Joe wrote: > and something in eval.php tries to log the call. I guess it's something... [16:59:17] (03CR) 10CDanis: [C:03+2] service catalog: add chart-renderer in service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1087515 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:03:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10293524 (10Jhancock.wm) [17:06:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T376905)', diff saved to https://phabricator.wikimedia.org/P70941 and previous config saved to /var/cache/conftool/dbconfig/20241105-170609-ladsgroup.json [17:06:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [17:06:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [17:06:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1028 (T376905)', diff saved to https://phabricator.wikimedia.org/P70942 and previous config saved to /var/cache/conftool/dbconfig/20241105-170636-ladsgroup.json [17:11:15] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q2): Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#10293583 (10lmata) [17:11:21] 10SRE-swift-storage, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q2): Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#10293584 (10lmata) [17:12:16] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q2): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#10293597 (10lmata) [17:13:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T376905)', diff saved to https://phabricator.wikimedia.org/P70943 and previous config saved to /var/cache/conftool/dbconfig/20241105-171330-ladsgroup.json [17:13:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10293609 (10Jhancock.wm) [17:15:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10293622 (10Jhancock.wm) @bking we got these in. Please update the site.pp file. We should be able to get these to you by End of... [17:16:03] 06SRE, 06cloud-services-team, 10Cloud-VPS, 10observability, and 3 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10293600 (10lmata) [17:16:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10293606 (10Jhancock.wm) @Marostegui this should make it diverse. lmk if you want something different. es2043 > C7 es2045 > A7 es2046 > B7 @ABran-WMF w... [17:26:23] (03CR) 10David Caro: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [17:28:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P70945 and previous config saved to /var/cache/conftool/dbconfig/20241105-172837-ladsgroup.json [17:28:52] (03PS1) 10CDanis: chart-renderer: enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087527 (https://phabricator.wikimedia.org/T372081) [17:29:11] (03PS1) 10Hnowlan: rest-gateway: order mw-api-int paths strictly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) [17:30:18] (03CR) 10Kamila Součková: [C:03+1] chart-renderer: enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087527 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:31:01] (03CR) 10Alexandros Kosiaris: "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [17:31:02] (03CR) 10Alexandros Kosiaris: [C:03+2] rest-gateway: order mw-api-int paths strictly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [17:31:02] (03CR) 10CDanis: [C:03+2] chart-renderer: enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087527 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:32:02] (03Merged) 10jenkins-bot: rest-gateway: order mw-api-int paths strictly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [17:32:27] (03Merged) 10jenkins-bot: chart-renderer: enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087527 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:32:30] (03PS1) 10FNegri: alertmanager: simplify WMCS templates [puppet] - 10https://gerrit.wikimedia.org/r/1087531 (https://phabricator.wikimedia.org/T375479) [17:32:43] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [17:32:59] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [17:33:10] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:33:20] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:34:36] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:34:51] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:35:10] (03PS1) 10CDanis: fix vscode file extension mangling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087532 (https://phabricator.wikimedia.org/T372081) [17:36:03] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:36:04] (03PS2) 10Eevans: replace list of cassandra hosts with faux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 [17:36:12] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:37:40] (03CR) 10CDanis: [C:03+2] fix vscode file extension mangling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087532 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:37:53] (03CR) 10Eevans: "Thanks @hnowlan@wikimedia.org; Thanks @swfrench@wikimedia.org!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [17:38:31] (03Abandoned) 10JHathaway: ms-be: partman EFI recipe [puppet] - 10https://gerrit.wikimedia.org/r/1087505 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [17:38:49] (03Merged) 10jenkins-bot: fix vscode file extension mangling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087532 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:39:20] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [17:39:29] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [17:40:29] PROBLEM - mysqld processes on pc1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:40:29] PROBLEM - MariaDB Replica SQL: pc5 on pc1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:40:29] PROBLEM - MariaDB Replica IO: pc5 on pc1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:40:29] PROBLEM - MariaDB Replica Lag: pc5 on pc1017 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:40:31] PROBLEM - MariaDB read only pc5 on pc1017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:40:31] PROBLEM - MariaDB Event Scheduler pc5 on pc1017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [17:41:40] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [17:41:56] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [17:41:59] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [17:42:16] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [17:43:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P70946 and previous config saved to /var/cache/conftool/dbconfig/20241105-174344-ladsgroup.json [17:44:09] uhhh is that pc1017 alert something to worry about? [17:44:18] ah, just reimaged [17:51:01] (03PS1) 10CDanis: chart-renderer: to production [puppet] - 10https://gerrit.wikimedia.org/r/1087536 (https://phabricator.wikimedia.org/T372081) [17:51:11] (03CR) 10CI reject: [V:04-1] chart-renderer: to production [puppet] - 10https://gerrit.wikimedia.org/r/1087536 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:51:14] (03PS2) 10CDanis: chart-renderer: to production [puppet] - 10https://gerrit.wikimedia.org/r/1087536 (https://phabricator.wikimedia.org/T372081) [17:55:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:55:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:55:50] (03CR) 10Kamila Součková: [C:03+1] chart-renderer: to production [puppet] - 10https://gerrit.wikimedia.org/r/1087536 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:55:59] (03CR) 10CDanis: [C:03+2] chart-renderer: to production [puppet] - 10https://gerrit.wikimedia.org/r/1087536 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [17:57:35] (03PS1) 10JHathaway: ms-be-simple: partman EFI recipe [puppet] - 10https://gerrit.wikimedia.org/r/1087538 (https://phabricator.wikimedia.org/T371400) [17:57:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:58:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T376905)', diff saved to https://phabricator.wikimedia.org/P70947 and previous config saved to /var/cache/conftool/dbconfig/20241105-175851-ladsgroup.json [17:59:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance [17:59:53] (03PS1) 10Esanders: Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087539 (https://phabricator.wikimedia.org/T379102) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1800) [18:00:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance [18:00:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1021 (T376905)', diff saved to https://phabricator.wikimedia.org/P70948 and previous config saved to /var/cache/conftool/dbconfig/20241105-180013-ladsgroup.json [18:02:55] RESOLVED: MaxConntrack: Max conntrack at 92.12% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:04:38] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Hopefully one less annoying manual thing." [puppet] - 10https://gerrit.wikimedia.org/r/1087412 (owner: 10Muehlenhoff) [18:06:34] (03CR) 10Scott French: [C:03+1] replace list of cassandra hosts with faux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [18:06:38] (03CR) 10Majavah: Add a helper script to setup the Ganeti LVM vg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087412 (owner: 10Muehlenhoff) [18:06:47] (03CR) 10MVernon: [C:03+1] "LGTM, thanks, although I am a very long way from a partman expert!" [puppet] - 10https://gerrit.wikimedia.org/r/1087538 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [18:10:52] !log gradual delete of thumbs in fawiki local images in both dcs [18:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:25] (03PS3) 10Scott French: changeprop: add per-rule consumer properties in jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087542 (https://phabricator.wikimedia.org/T356241) [18:22:07] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/f1609191c411ecee6097dae25efe86b565ec8bcc80fe892af9879c08eb96a590/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [18:23:29] (03CR) 10Hnowlan: [C:03+1] "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087542 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [18:24:31] (03PS4) 10Anzx: cswiki: adding throttle rule for Editathon Czechoslovakia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087540 (https://phabricator.wikimedia.org/T379060) [18:24:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087540 (https://phabricator.wikimedia.org/T379060) (owner: 10Anzx) [18:28:25] (03PS5) 10Anzx: cswiki: adding throttle rule for Editathon Czechoslovakia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087540 (https://phabricator.wikimedia.org/T379060) [18:40:25] (03PS4) 10Muehlenhoff: Add a helper script to setup the Ganeti LVM vg [puppet] - 10https://gerrit.wikimedia.org/r/1087412 [18:41:15] (03CR) 10Muehlenhoff: Add a helper script to setup the Ganeti LVM vg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087412 (owner: 10Muehlenhoff) [18:42:07] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [18:43:05] (03PS2) 10Dzahn: admin: add group approver for ldap-admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1085435 (https://phabricator.wikimedia.org/T276465) [18:45:45] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:46:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 74428224 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:47:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6898296 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:48:20] (03PS2) 10Muehlenhoff: admin: add group approver for ldap-admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1085435 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [18:48:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1085435 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [19:00:05] jnuche and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T1900) [19:21:10] (03PS2) 10Dzahn: admin: add group approver for ldap-admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1085435 (https://phabricator.wikimedia.org/T276465) [19:21:19] (03PS3) 10Dzahn: admin: add group approver for ldap-admins [puppet] - 10https://gerrit.wikimedia.org/r/1085435 (https://phabricator.wikimedia.org/T276465) [19:21:35] (03CR) 10Dzahn: [C:03+2] admin: add group approver for ldap-admins [puppet] - 10https://gerrit.wikimedia.org/r/1085435 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [19:36:22] (03PS1) 10Scott French: changeprop-jobqueue: update to 2024-11-05-170900-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087557 (https://phabricator.wikimedia.org/T356241) [19:36:24] (03PS1) 10Scott French: changeprop-jobqueue: set max poll interval and revert concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087558 (https://phabricator.wikimedia.org/T356241) [19:39:26] (03PS1) 10Urbanecm: AbstractProvider: Normalize top level config correctly [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087560 (https://phabricator.wikimedia.org/T379094) [19:39:36] (03PS1) 10Urbanecm: AbstractProvider: Normalize top level config correctly [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087561 (https://phabricator.wikimedia.org/T379094) [19:40:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10294144 (10cmooney) >>! In T377381#10274577, @Jclark-ctr wrote: > @cmooney fyi i have 10x of the 100g green handled optics J... [19:52:33] !log cmooney@cumin1002 START - Cookbook sre.network.provision for device fasw2-c1a-eqiad.mgmt.eqiad.wmnet [19:52:33] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device fasw2-c1a-eqiad.mgmt.eqiad.wmnet [19:56:36] !log cmooney@cumin1002 START - Cookbook sre.network.provision for device fasw2-c1a-eqiad.mgmt.eqiad.wmnet [19:56:37] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device fasw2-c1a-eqiad.mgmt.eqiad.wmnet [19:57:50] !log cmooney@cumin1002 START - Cookbook sre.network.provision for device fasw2-c1a-eqiad.mgmt.eqiad.wmnet [19:57:52] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:02:30] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw2-c1a-eqiad - cmooney@cumin1002" [20:02:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw2-c1a-eqiad - cmooney@cumin1002" [20:02:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:05:17] (03PS1) 10CDanis: services_proxy: add chart-renderer & enable for MW [puppet] - 10https://gerrit.wikimedia.org/r/1087565 (https://phabricator.wikimedia.org/T372081) [20:07:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:07:34] !log cmooney@cumin1002 START - Cookbook sre.network.provision for device fasw2-c1b-eqiad.mgmt.eqiad.wmnet [20:07:36] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:14:00] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw2-c1b-eqiad - cmooney@cumin1002" [20:14:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw2-c1b-eqiad - cmooney@cumin1002" [20:14:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:48:39] (03CR) 10Scott French: [C:03+1] services_proxy: add chart-renderer & enable for MW [puppet] - 10https://gerrit.wikimedia.org/r/1087565 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [20:53:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087560 (https://phabricator.wikimedia.org/T379094) (owner: 10Urbanecm) [20:53:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087561 (https://phabricator.wikimedia.org/T379094) (owner: 10Urbanecm) [20:54:37] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485#10294334 (10cmooney) I used the cookbook to provision the two new frack switches in eqiad this evening. Mostly it worked ok,... [20:55:46] (03CR) 10Urbanecm: [C:03+2] AbstractProvider: Normalize top level config correctly [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087560 (https://phabricator.wikimedia.org/T379094) (owner: 10Urbanecm) [20:55:47] (03CR) 10Urbanecm: [C:03+2] AbstractProvider: Normalize top level config correctly [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087561 (https://phabricator.wikimedia.org/T379094) (owner: 10Urbanecm) [20:56:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:56:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw2-c1a-eqiad.mgmt.eqiad.wmnet [20:58:19] PROBLEM - Host ganeti1041 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T2100). Please do the needful. [21:00:05] anzx and urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:25] i can deploy today [21:00:28] anzx: hi! [21:00:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw2-c1b-eqiad.mgmt.eqiad.wmnet [21:00:47] RECOVERY - Host ganeti1041 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [21:00:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087540 (https://phabricator.wikimedia.org/T379060) (owner: 10Anzx) [21:00:55] i'll go ahead given it is a throttle rule, and there's nothing to do anyway [21:01:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:01:37] (03Merged) 10jenkins-bot: cswiki: adding throttle rule for Editathon Czechoslovakia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087540 (https://phabricator.wikimedia.org/T379060) (owner: 10Anzx) [21:02:08] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1087540|cswiki: adding throttle rule for Editathon Czechoslovakia (T379060)]] [21:02:19] T379060: Lift IP cap on 2024-11-06 for Editathon Czechoslovakia - cs.wikipedia - https://phabricator.wikimedia.org/T379060 [21:02:29] (03PS1) 10Cathal Mooney: Configure cumin ssh to use network settings for fasw switches [puppet] - 10https://gerrit.wikimedia.org/r/1087572 (https://phabricator.wikimedia.org/T336485) [21:03:09] PROBLEM - Host ganeti1041 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:47] RECOVERY - Host ganeti1041 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [21:06:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:06:48] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T379116 (10phaultfinder) 03NEW [21:11:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:13:52] (03Merged) 10jenkins-bot: AbstractProvider: Normalize top level config correctly [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087560 (https://phabricator.wikimedia.org/T379094) (owner: 10Urbanecm) [21:13:53] (03Merged) 10jenkins-bot: AbstractProvider: Normalize top level config correctly [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087561 (https://phabricator.wikimedia.org/T379094) (owner: 10Urbanecm) [21:14:10] still in progress... [21:17:28] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#10294415 (10Dzahn) [21:22:00] (03PS1) 10Dzahn: admin: add group approvers for druid-admins, htmldumps-admin, udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) [21:25:42] (03CR) 10Dzahn: "While doing this I was also thinking about adding a new type of group to this yaml. "group of approvers", so we could use "*data_platform_" [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [21:27:33] (03CR) 10CDanis: [C:03+2] services_proxy: add chart-renderer & enable for MW [puppet] - 10https://gerrit.wikimedia.org/r/1087565 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [21:28:31] still deploying the throttle patches... [21:33:10] jouncebot: nowandnext [21:33:10] For the next 0 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241105T2100) [21:33:10] In 9 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T0700) [21:33:26] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087540|cswiki: adding throttle rule for Editathon Czechoslovakia (T379060)]] (duration: 31m 18s) [21:33:29] T379060: Lift IP cap on 2024-11-06 for Editathon Czechoslovakia - cs.wikipedia - https://phabricator.wikimedia.org/T379060 [21:33:33] finally [21:33:36] cdanis: still deploying [21:33:44] 31minutes for a simple config change was unexpected [21:33:47] :| [21:33:51] what step took so long? [21:34:23] all of them, kind of [21:34:31] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1087560|AbstractProvider: Normalize top level config correctly (T379094)]], [[gerrit:1087561|AbstractProvider: Normalize top level config correctly (T379094)]] [21:34:33] T379094: TypeError: Argument 1 passed to MediaWiki\Extension\CommunityConfiguration\Provider\AbstractProvider::normalizeTopLevelConfigData() must be an instance of stdClass, array given, called in /srv/mediawiki/php-1.44.0-wmf.1/extensi - https://phabricator.wikimedia.org/T379094 [21:35:19] longest two on scap record: build-and-push-container-images (duration: 11m 57s) , sync-testservers-k8s (duration: 05m 02s) [21:35:23] but that doesn't add up to 31 minutes... [21:35:29] yeah... [21:36:45] complete traceback https://www.irccloud.com/pastebin/MaBCwDHA/ [21:36:47] cdanis: ^^ [21:39:13] urbanecm: can you provide the contents of /home/urbanecm/scap-image-build-and-push-log as well? i just merged some changes into the release project that performs php8.1 builds and it's possible that your backport kicked off the first of that build [21:39:51] dduvall: not unless that file gets archived somewhere i started another scap by now [21:40:02] dduvall: yeah, I was just about to comment - looks like this pulled in the release scripts changes that were merged [21:41:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10294508 (10bking) @Jhancock.wm I think the hosts are [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084253/1/manifests/... [21:41:39] urbanecm: ok. no problem. it's most likely the full image build that took a long time, but we'll definitely investigate if it persists [21:41:49] sorry, i should have mentioned that here prior to the window [21:42:07] ack. let's see how this one goes :) [21:42:18] :) [21:42:34] dduvall: should we start auto-archiving those logs? maybe when a build-and-push step takes longer than N minutes [21:43:12] i believe the plan is to integrate build-images.py into scap soon. at that point the logs will be unified [21:43:16] ack [21:43:24] ... docker buildkit can emit opentelemetry? [21:43:52] or maybe do an empty backport whenever a change is merged, so that it doesn't impact random deployers? [21:44:08] it'd be super inconvenient were i to do a rollback that i need to get out ASAP [21:44:12] cdanis: buildkitd can [21:44:41] we could hook that up to jaeger if we wanted to [21:44:56] interesting... [21:45:09] urbanecm: yeah I think +1 on that from me, ideally at least build, push, and maybe pre-pull the images in prod [21:45:48] urbanecm: the incremental image builds are a lot faster. it's just that your backport happened to kick off the very first php8.1 based image which is a full rebuild [21:45:52] _was_ a full rebuild [21:46:05] from here on the incremental optimizations _should_ work correctly [21:46:17] still can hit a patch that needs to go out ASAP though :) [21:46:23] yeah very true [21:47:10] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087560|AbstractProvider: Normalize top level config correctly (T379094)]], [[gerrit:1087561|AbstractProvider: Normalize top level config correctly (T379094)]] (duration: 12m 39s) [21:47:16] there's a task open that tracks an idea for preemptively building images based on current backports [21:47:21] T379094: TypeError: Argument 1 passed to MediaWiki\Extension\CommunityConfiguration\Provider\AbstractProvider::normalizeTopLevelConfigData() must be an instance of stdClass, array given, called in /srv/mediawiki/php-1.44.0-wmf.1/extensi - https://phabricator.wikimedia.org/T379094 [21:47:21] * dduvall finds [21:47:24] well, this one was lot quicker [21:47:29] oh good! [21:47:33] dduvall: let me know if you want me to capture this idea somewhere [21:48:52] urbanecm: take a look at https://phabricator.wikimedia.org/T378428 and see if it captures some of the idea, and please add your idea/thoughts [21:49:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10294542 (10KFrancis) Hi all, confirming the NDA has been signed. Thanks! [21:49:32] there might have also been some amount of churn for the existing 7.4-based images as well, going by how long the various updates took (e.g., sync-testservers-k8s took 5m) [21:49:43] ty dduvall [21:49:48] just guessing, as those delays are usually the image pull [21:49:51] np [21:49:53] swfrench-wmf: ah yeah [21:50:07] 12 mins is still longer than 7.5min earlier today [21:50:37] but at least not as significant [21:50:50] urbanecm: do you have the image build log of this one still? [21:50:54] i would love to have a look [21:50:57] yep [21:51:06] dduvall: is it ok to send them in public? [21:51:33] the image builds are now done in parallel. i was hoping it would be a net optimization, but there is a whole additional mw image build in there too [21:51:42] urbanecm: good question. i'm not 100% on that [21:51:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294547 (10Jhancock.wm) [21:51:57] I can copy them between homedirs on deploy200x if needed [21:52:09] cdanis: urbanecm that works for me [21:53:05] swfrench-wmf: optimizations aside, i'm glad it's working :D [21:53:21] cdanis: the file's 664 anyway, pretty much anyone can do anything to it [21:53:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:53:36] dduvall: yeah, I was just about to say - hooray and thank you :) [21:53:58] urbanecm: copied it. ty! [21:54:01] we now indeed have 8.1 images published [21:54:10] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10294549 (10Jclark-ctr) This is still under warranty @VRiley-WMF If you want to reopen the ticket with Dell ` 2024-11-05T21:25:01.773142+00:00 db1246 rsyslog... [21:54:15] swfrench-wmf: yw and thanks for the reviews [21:54:34] (and the scap changes :D) [21:55:16] dduvall: https://phabricator.wikimedia.org/P70951 if you want a more permanent link (it also has the scap terminal log in case you need it too) [21:55:24] posted as a NDA paste, should be good enough [21:55:48] urbanecm: thanks! [21:55:58] seems like the image builds were pretty fast this time actually [21:56:46] urbanecm: do you still have the scap output as well? i'm curious to see how the image build log timestamps and that task timing compare [21:57:01] dduvall: that should be on the paste? [21:57:06] *the image build task* timings compare i mean [21:57:19] urbanecm: ah, yes indeed [21:57:37] sorry, looked at the one on disk first [21:57:44] (03PS1) 10Tchanders: temp accounts: Enable IP reveal rights for local groups on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087584 (https://phabricator.wikimedia.org/T356294) [21:58:09] i don't have timing from the short one, cleared my terminal logs in the last...9 hours or so [21:58:30] np [21:59:07] would be great if scap finished with a summary of the top 5 longest timed tasks or similar [22:00:21] yep. or logged everything to /var/log/scap or something [22:09:20] scap does log to logstash [22:10:04] https://logstash.wikimedia.org/goto/1f3d40b27eae2e66cc9e73c8386af9d2 the dashboard is challenging, even by kibana standards [22:10:07] but the data is there [22:10:42] anyway I have to go afk now but sometime soon dduvall let's explore exporting whatever otel trace data we can from buildx/buildkit [22:12:43] ooo, cool! til [22:25:42] (03CR) 10Peter Fischer: [C:03+1] WIP: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [22:26:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:26:24] cdanis: sounds good! [22:29:44] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2132 [22:29:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2130 to codfw - jhancock@cumin2002" [22:29:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2130 to codfw - jhancock@cumin2002" [22:29:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:30:08] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2130 [22:30:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2131 [22:30:11] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2132 [22:30:12] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2133 [22:30:14] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2134 [22:30:15] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2135 [22:30:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2130 [22:30:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2131 [22:30:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2132 [22:30:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2133 [22:30:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2135 [22:30:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2134 [22:31:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2130.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:31:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2131.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:31:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2132.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:31:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2133.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:31:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2134.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:31:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2135.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2131.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2133.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2130.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2132.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2134.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2135.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:52:13] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2130'] [22:52:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2131'] [22:52:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2132'] [22:52:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2133'] [22:52:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2134'] [22:52:38] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2135'] [22:52:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2130'] [22:53:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2131'] [22:53:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2132'] [22:53:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2133'] [22:53:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2134'] [22:53:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2135'] [22:54:06] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10294725 (10eoghan) >>! In T377045#10287153, @Platonides wrote: > The bug for multiple mailing lists was fixed several years ago: https://gitlab... [22:58:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2130.codfw.wmnet with OS bookworm [22:58:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2131.codfw.wmnet with OS bookworm [22:58:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2132.codfw.wmnet with OS bookworm [22:58:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2133.codfw.wmnet with OS bookworm [22:58:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294729 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2130.codfw.wmnet with OS bookworm [22:58:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2131.codfw.wmnet with OS bookworm [22:58:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2132.codfw.wmnet with OS bookworm [22:58:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294732 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2133.codfw.wmnet with OS bookworm [23:00:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2134.codfw.wmnet with OS bookworm [23:00:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2135.codfw.wmnet with OS bookworm [23:01:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2134.codfw.wmnet with OS bookworm [23:01:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294737 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2135.codfw.wmnet with OS bookworm [23:04:55] (03PS1) 10Arlolra: Revert "Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway"" [puppet] - 10https://gerrit.wikimedia.org/r/1087591 (https://phabricator.wikimedia.org/T374683) [23:09:27] (03CR) 10Subramanya Sastry: [C:03+1] Revert "Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway"" [puppet] - 10https://gerrit.wikimedia.org/r/1087591 (https://phabricator.wikimedia.org/T374683) (owner: 10Arlolra) [23:10:11] (03PS1) 10Ebernhardson: TextPassDumper: refresh content address on failure [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087592 (https://phabricator.wikimedia.org/T377594) [23:10:34] (03PS1) 10Ebernhardson: TextPassDumper: refresh content address on failure [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087593 (https://phabricator.wikimedia.org/T377594) [23:11:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:16:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2133.codfw.wmnet with reason: host reimage [23:16:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2130.codfw.wmnet with reason: host reimage [23:16:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2131.codfw.wmnet with reason: host reimage [23:17:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2132.codfw.wmnet with reason: host reimage [23:17:51] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:17:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087592 (https://phabricator.wikimedia.org/T377594) (owner: 10Ebernhardson) [23:17:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087593 (https://phabricator.wikimedia.org/T377594) (owner: 10Ebernhardson) [23:18:27] (03CR) 10Ladsgroup: [C:03+2] Revert "Revert^2 "ats: Route rest_v1/page/(html|title) to rest-gateway"" [puppet] - 10https://gerrit.wikimedia.org/r/1087591 (https://phabricator.wikimedia.org/T374683) (owner: 10Arlolra) [23:18:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2134.codfw.wmnet with reason: host reimage [23:18:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2135.codfw.wmnet with reason: host reimage [23:19:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2133.codfw.wmnet with reason: host reimage [23:23:01] (03CR) 10Urbanecm: [C:04-1] "per conversation in Slack, Global Contributons should be more clear about how permissions work there, especially for users with local acce" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087584 (https://phabricator.wikimedia.org/T356294) (owner: 10Tchanders) [23:23:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2130.codfw.wmnet with reason: host reimage [23:26:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2135.codfw.wmnet with reason: host reimage [23:30:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2131.codfw.wmnet with reason: host reimage [23:33:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2132.codfw.wmnet with reason: host reimage [23:38:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2134.codfw.wmnet with reason: host reimage [23:39:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:42:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10294794 (10Jclark-ctr) I have gone through and rerun all provision scripts on these i believe this is good to close @elukey [23:42:51] (03PS1) 10Ryan Kemper: Revert "Pause the XML/SQL dumps due to potential data quality issues" [puppet] - 10https://gerrit.wikimedia.org/r/1087598 (https://phabricator.wikimedia.org/T377594) [23:43:30] (03CR) 10CI reject: [V:04-1] Revert "Pause the XML/SQL dumps due to potential data quality issues" [puppet] - 10https://gerrit.wikimedia.org/r/1087598 (https://phabricator.wikimedia.org/T377594) (owner: 10Ryan Kemper) [23:44:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:50:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:53:04] (03Merged) 10jenkins-bot: TextPassDumper: refresh content address on failure [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1087592 (https://phabricator.wikimedia.org/T377594) (owner: 10Ebernhardson) [23:53:09] (03Merged) 10jenkins-bot: TextPassDumper: refresh content address on failure [core] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087593 (https://phabricator.wikimedia.org/T377594) (owner: 10Ebernhardson) [23:53:42] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1087592|TextPassDumper: refresh content address on failure (T377594)]], [[gerrit:1087593|TextPassDumper: refresh content address on failure (T377594)]] [23:53:54] T377594: Fix Dumps - errors exporting good revisions - https://phabricator.wikimedia.org/T377594 [23:54:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:54:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2131.codfw.wmnet with OS bookworm [23:54:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:54:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2133.codfw.wmnet with OS bookworm [23:54:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:54:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2130.codfw.wmnet with OS bookworm [23:55:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:55:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2131.codfw.wmnet with OS bookworm completed: - wi... [23:55:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2133.codfw.wmnet with OS bookworm completed: - wi... [23:55:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2130.codfw.wmnet with OS bookworm completed: - wi... [23:56:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:56:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2132.codfw.wmnet with OS bookworm [23:56:37] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1087592|TextPassDumper: refresh content address on failure (T377594)]], [[gerrit:1087593|TextPassDumper: refresh content address on failure (T377594)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:57:03] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:57:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:57:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2135.codfw.wmnet with OS bookworm [23:57:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:57:39] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [23:58:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:58:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2134.codfw.wmnet with OS bookworm [23:58:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2132.codfw.wmnet with OS bookworm completed: - wi... [23:58:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2135.codfw.wmnet with OS bookworm completed: - wi... [23:59:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294851 (10Jhancock.wm) [23:59:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2134.codfw.wmnet with OS bookworm completed: - wi... [23:59:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10294853 (10Jhancock.wm) [23:59:17] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool es1021 gradually with 4 steps - Maint over