[00:00:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P70444 and previous config saved to /var/cache/conftool/dbconfig/20241022-000032-ladsgroup.json [00:05:25] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082109 [00:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082109 (owner: 10TrainBranchBot) [00:10:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082106 (owner: 10TrainBranchBot) [00:15:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T376905)', diff saved to https://phabricator.wikimedia.org/P70445 and previous config saved to /var/cache/conftool/dbconfig/20241022-001539-ladsgroup.json [00:15:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: Maintenance [00:15:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: Maintenance [00:16:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2229 (T376905)', diff saved to https://phabricator.wikimedia.org/P70446 and previous config saved to /var/cache/conftool/dbconfig/20241022-001606-ladsgroup.json [00:23:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T376905)', diff saved to https://phabricator.wikimedia.org/P70447 and previous config saved to /var/cache/conftool/dbconfig/20241022-002259-ladsgroup.json [00:38:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P70448 and previous config saved to /var/cache/conftool/dbconfig/20241022-003807-ladsgroup.json [00:40:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082109 (owner: 10TrainBranchBot) [00:46:23] (03CR) 10Ssingh: varnish: Give 1% of views RSA cert warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [00:53:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P70449 and previous config saved to /var/cache/conftool/dbconfig/20241022-005313-ladsgroup.json [01:05:55] (03PS2) 10RLazarus: deployment_server: Add JSON output mode to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1081986 (https://phabricator.wikimedia.org/T377292) [01:08:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T376905)', diff saved to https://phabricator.wikimedia.org/P70450 and previous config saved to /var/cache/conftool/dbconfig/20241022-010820-ladsgroup.json [01:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.28 [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082114 (https://phabricator.wikimedia.org/T375659) [01:08:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.28 [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082114 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [01:09:26] (03CR) 10RLazarus: [C:03+2] deployment_server: mwscript-k8s logging cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1081265 (https://phabricator.wikimedia.org/T377292) (owner: 10RLazarus) [01:09:32] (03CR) 10RLazarus: [C:03+2] deployment_server: Refactor mwscript-k8s preparatory to adding --output (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081985 (https://phabricator.wikimedia.org/T377292) (owner: 10RLazarus) [01:09:42] (03CR) 10RLazarus: [C:03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1081986 (https://phabricator.wikimedia.org/T377292) (owner: 10RLazarus) [01:43:40] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.28 [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082114 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0200) [02:01:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:57:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0300) [03:01:39] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082122 (https://phabricator.wikimedia.org/T375659) [03:01:41] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082122 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [03:02:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:24] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082122 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [03:02:51] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.43.0-wmf.28 refs T375659 [03:03:19] T375659: 1.43.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T375659 [03:05:25] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:28] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.43.0-wmf.28 refs T375659 (duration: 49m 37s) [03:52:44] T375659: 1.43.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T375659 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0400) [04:01:00] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.25 (duration: 00m 58s) [05:24:27] Deploying Cxserver. Some major changes! [05:24:39] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:25:13] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:30:24] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:31:01] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:41:41] !log Remove servicerunner dependency for cxserver (T357950, T373777) [05:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:47] T357950: Remove servicerunner dependency for cxserver - https://phabricator.wikimedia.org/T357950 [05:41:47] T373777: Update production config for servicerunner dependency removal for cxserver - https://phabricator.wikimedia.org/T373777 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0600). [06:01:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:51] (03CR) 10Jelto: [C:03+1] "lgtm, feel free to self merge and deploy or let me know if you need any help" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082089 (https://phabricator.wikimedia.org/T377168) (owner: 10SBassett) [06:17:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:22:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:41:27] I'm not able to edit grafana dashboard (ie cxserver in this case). Anyone facing same issue? [06:43:34] kart_: looks ok to me [06:50:55] I'm getting JS errors (eg. TypeError: a.filter is not a function) [06:51:14] (03PS1) 10Muehlenhoff: Point irc.w.o to irc1003, take two [dns] - 10https://gerrit.wikimedia.org/r/1082129 (https://phabricator.wikimedia.org/T376014) [06:54:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [06:55:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 25%: post clone', diff saved to https://phabricator.wikimedia.org/P70451 and previous config saved to /var/cache/conftool/dbconfig/20241022-065513-arnaudb.json [06:56:11] (03CR) 10Slyngshede: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1082129 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [06:56:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6003.wikimedia.org [06:56:47] fabfur: seems OK now :) [06:57:43] ack! [06:59:18] 06SRE, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10249103 (10MoritzMuehlenhoff) [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [07:02:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6003.wikimedia.org [07:06:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:10:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 26%: post clone', diff saved to https://phabricator.wikimedia.org/P70452 and previous config saved to /var/cache/conftool/dbconfig/20241022-071018-arnaudb.json [07:23:01] !log rearm keyholder on netmon1003 [07:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 27%: post clone', diff saved to https://phabricator.wikimedia.org/P70453 and previous config saved to /var/cache/conftool/dbconfig/20241022-072523-arnaudb.json [07:26:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:28:48] !log installing Java 17 security updates [07:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:28] (03CR) 10Alexandros Kosiaris: [C:03+2] "The mistake here is mine. I did run sre.dns.netbox after merging this and the dependent change, got an error I didn't expect (turns out I " [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [07:36:07] (03PS2) 10Slyngshede: Allow CAS to have Redis supported enabled in overlay. [puppet] - 10https://gerrit.wikimedia.org/r/1081980 (https://phabricator.wikimedia.org/T377728) [07:36:56] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4329/co" [puppet] - 10https://gerrit.wikimedia.org/r/1081980 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [07:40:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 28%: post clone', diff saved to https://phabricator.wikimedia.org/P70454 and previous config saved to /var/cache/conftool/dbconfig/20241022-074029-arnaudb.json [07:46:33] (03PS1) 10Slyngshede: Upgrade, v7.0.9, and enable Redis ticket registry. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1082150 (https://phabricator.wikimedia.org/T377728) [07:47:08] (03CR) 10Elukey: [C:03+1] Point irc.w.o to irc1003, take two [dns] - 10https://gerrit.wikimedia.org/r/1082129 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [07:50:36] (03CR) 10Muehlenhoff: [C:03+2] Point irc.w.o to irc1003, take two [dns] - 10https://gerrit.wikimedia.org/r/1082129 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [07:55:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 50%: post clone', diff saved to https://phabricator.wikimedia.org/P70455 and previous config saved to /var/cache/conftool/dbconfig/20241022-075534-arnaudb.json [07:58:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T377718', diff saved to https://phabricator.wikimedia.org/P70456 and previous config saved to /var/cache/conftool/dbconfig/20241022-075830-arnaudb.json [07:58:34] T377718: db2205 and db2207 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [07:58:51] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:59:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db[2149,2205].codfw.wmnet with reason: db2205 reclone [07:59:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2149,2205].codfw.wmnet with reason: db2205 reclone [08:00:15] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:03:24] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:03:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:04:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [08:05:25] (03CR) 10Lucas Werkmeister (WMDE): "I’m a bit confused, what’s the difference between the “top-level” `dblist` and `databases[].dblist`? Is one just a shortcut for the other?" [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [08:06:18] (03PS1) 10Brouberol: datahub: add metadata topics env vars to the gms and mae-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) [08:08:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [08:10:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 75%: post clone', diff saved to https://phabricator.wikimedia.org/P70457 and previous config saved to /var/cache/conftool/dbconfig/20241022-081040-arnaudb.json [08:15:45] (03PS1) 10Muehlenhoff: Drop the ircstream CNAME [dns] - 10https://gerrit.wikimedia.org/r/1082154 (https://phabricator.wikimedia.org/T376014) [08:16:24] (03CR) 10Elukey: [C:03+1] Drop the ircstream CNAME [dns] - 10https://gerrit.wikimedia.org/r/1082154 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [08:24:41] !log irc.wikimedia.org has been switched to ircstream T376014 [08:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:53] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [08:25:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 100%: post clone', diff saved to https://phabricator.wikimedia.org/P70459 and previous config saved to /var/cache/conftool/dbconfig/20241022-082545-arnaudb.json [08:32:52] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:33:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:33:38] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:33:49] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:34:55] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:35:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:37:22] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2205.codfw.wmnet [08:38:51] (03CR) 10AOkoth: [C:03+1] Update miscweb: security-landing-page to latest image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082089 (https://phabricator.wikimedia.org/T377168) (owner: 10SBassett) [08:46:43] (03PS1) 10Lucas Werkmeister (WMDE): tests: Don't depend on Message implementation details [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082160 (https://phabricator.wikimedia.org/T377778) [08:48:57] (03CR) 10Btullis: [C:03+1] "Great, many thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) (owner: 10Brouberol) [08:51:06] (03PS1) 10Gmodena: charts: airflow: alert only on task failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082163 (https://phabricator.wikimedia.org/T377745) [08:56:39] (03PS2) 10Brouberol: datahub: add metadata topics env vars to the gms and mae-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) [08:57:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [08:57:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [09:00:09] (03CR) 10Btullis: datahub: add metadata topics env vars to the gms and mae-consumer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) (owner: 10Brouberol) [09:02:51] (03CR) 10CI reject: [V:04-1] tests: Don't depend on Message implementation details [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082160 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [09:04:56] (03Abandoned) 10Lucas Werkmeister (WMDE): tests: Don't depend on Message implementation details [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082160 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [09:05:03] (03PS3) 10Brouberol: datahub: add metadata topics env vars to the gms and mae-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) [09:06:16] !log Restarting Gerrit [09:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:03] Consumed 2w 21h 25min 19.244s CPU time. [09:16:43] (03CR) 10Btullis: [C:03+1] datahub: add metadata topics env vars to the gms and mae-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) (owner: 10Brouberol) [09:16:53] (03CR) 10Brouberol: [C:03+2] datahub: add metadata topics env vars to the gms and mae-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) (owner: 10Brouberol) [09:17:04] (03CR) 10Btullis: [C:03+1] datahub: add metadata topics env vars to the gms and mae-consumer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) (owner: 10Brouberol) [09:17:19] (03CR) 10Brouberol: [V:03+2 C:03+2] datahub: add metadata topics env vars to the gms and mae-consumer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082151 (https://phabricator.wikimedia.org/T377753) (owner: 10Brouberol) [09:18:21] (03PS1) 10Jelto: wikidata-query-gui: move query.wikidata.org into separate values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) [09:18:23] (03PS1) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082167 (https://phabricator.wikimedia.org/T350793) [09:21:36] (03PS2) 10Jelto: wikidata-query-gui: move query.wikidata.org into separate values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) [09:22:08] !log Restarting CI Jenkins [09:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:12] (03PS2) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082167 (https://phabricator.wikimedia.org/T350793) [09:26:13] (03PS3) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082167 (https://phabricator.wikimedia.org/T350793) [09:27:26] (03PS1) 10Brouberol: datahub: fix GMS host in mce consumer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082169 [09:27:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:28:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:28:28] (03CR) 10Btullis: [C:03+1] datahub: fix GMS host in mce consumer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082169 (owner: 10Brouberol) [09:29:27] (03CR) 10Brouberol: [C:03+2] datahub: fix GMS host in mce consumer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082169 (owner: 10Brouberol) [09:32:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:33:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:33:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:33:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T376905)', diff saved to https://phabricator.wikimedia.org/P70460 and previous config saved to /var/cache/conftool/dbconfig/20241022-093345-ladsgroup.json [09:34:41] (03CR) 10Hnowlan: [C:03+1] modules/admin: Add bd808 to contint-roots and contint-docker groups [puppet] - 10https://gerrit.wikimedia.org/r/1082105 (https://phabricator.wikimedia.org/T377792) (owner: 10BryanDavis) [09:35:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10249338 (10elukey) Updated the firmware of ms-be208[1-3] and ran the provision cookbook, all good! Procedure that I used: * ssh tunnel to access the BM... [09:36:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [09:37:16] (03CR) 10Ladsgroup: "Yes. Most cases can use top-level dblist instead." [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [09:38:27] (03CR) 10Elukey: [C:04-1] "I wish this was done!" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [09:40:25] (03PS1) 10Jcrespo: mediabackups: Setup new host for mediabackups backup[12]012 [puppet] - 10https://gerrit.wikimedia.org/r/1082172 (https://phabricator.wikimedia.org/T376892) [09:41:44] (03PS1) 10Ammarpad: contactpages: Update Affcom UserGroup application form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082174 (https://phabricator.wikimedia.org/T375392) [09:43:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T376905)', diff saved to https://phabricator.wikimedia.org/P70461 and previous config saved to /var/cache/conftool/dbconfig/20241022-094322-ladsgroup.json [09:45:11] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10249354 (10hnowlan) Could you please specify which groups access is needed to? There are a few dumps groups but it appears that @gmodena should inherit access to all of them by virtue of being part of th... [09:47:27] (03PS1) 10Brouberol: datahub: don't include the scheme in the GMS host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082176 [09:47:35] (03CR) 10Brouberol: [C:03+2] datahub: don't include the scheme in the GMS host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082176 (owner: 10Brouberol) [09:47:57] (03PS11) 10Volans: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) [09:51:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:51:46] (03CR) 10Volans: [C:03+2] apiclient: add a generic API client module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 (owner: 10Volans) [09:52:12] jouncebot: now [09:52:12] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [09:52:15] jouncebot: next [09:52:16] In 0 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1000) [09:56:21] (03PS2) 10Jcrespo: mediabackups: Setup new host for mediabackups backup1012 [puppet] - 10https://gerrit.wikimedia.org/r/1082172 (https://phabricator.wikimedia.org/T376892) [09:56:41] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10249377 (10gmodena) >>! In T377773#10249354, @hnowlan wrote: > Could you please specify which groups access is needed to? There are a few dumps groups but it appears that @gmodena should inherit access t... [09:58:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P70463 and previous config saved to /var/cache/conftool/dbconfig/20241022-095829-ladsgroup.json [09:59:10] (03CR) 10Jcrespo: [C:03+2] mediabackups: Setup new host for mediabackups backup1012 [puppet] - 10https://gerrit.wikimedia.org/r/1082172 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1000) [10:01:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:01:31] (03Merged) 10jenkins-bot: apiclient: add a generic API client module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 (owner: 10Volans) [10:02:28] (03PS2) 10Brouberol: datahub-frontend: use the same condition than other subcharts to generate GMS_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082179 [10:02:29] (03CR) 10Brouberol: [C:03+2] datahub-frontend: use the same condition than other subcharts to generate GMS_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082179 (owner: 10Brouberol) [10:02:50] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@dcf019d]: (no justification provided) [10:03:00] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@dcf019d]: (no justification provided) (duration: 00m 11s) [10:03:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2149.codfw.wmnet onto db2205.codfw.wmnet [10:03:59] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: sync [10:04:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [10:04:24] (03CR) 10Volans: "Some improvements inline for the tests and a typo. Almost ready." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [10:04:36] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: sync [10:06:03] (03CR) 10Volans: [C:03+2] redfish: use the new apiclient module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 (owner: 10Volans) [10:07:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [10:08:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [10:10:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10249440 (10cmooney) >>! In T377381#10246886, @Jgreen wrote: >> That's a bit of a shame in some ways but no problems > > We'll... [10:12:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [10:12:23] (03PS1) 10Lucas Werkmeister (WMDE): tests: Don't depend on Message implementation details [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082180 (https://phabricator.wikimedia.org/T377778) [10:12:35] (03PS1) 10Lucas Werkmeister (WMDE): Update for Message/MessageValue changes [extensions/WikibaseQualityConstraints] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082181 (https://phabricator.wikimedia.org/T377778) [10:13:04] (03PS2) 10Tiziano Fogli: prometheus/cadvisor: lookup extra metrics from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) [10:13:04] (03CR) 10Tiziano Fogli: "@bking@wikimedia.org: The key in Hiera will be profile::prometheus::cadvisor::metrics_enabled_extra, not profile::prometheus::cadvisor::me" [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) (owner: 10Tiziano Fogli) [10:13:16] (03CR) 10Lucas Werkmeister (WMDE): "As the CI fix for this depends on I2058ca0b9c and vice versa, this will need to be force-merged for deployment." [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082180 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [10:13:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P70464 and previous config saved to /var/cache/conftool/dbconfig/20241022-101336-ladsgroup.json [10:13:48] (03PS10) 10Arnaudb: mysql_legacy: get systemd status for instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) [10:14:21] (03CR) 10Lucas Werkmeister (WMDE): "Note: I’ve dropped the Depends-On from the commit message (compare the version of the change on the master branch), as this commit needs I" [extensions/WikibaseQualityConstraints] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082181 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [10:15:34] (03Merged) 10jenkins-bot: redfish: use the new apiclient module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 (owner: 10Volans) [10:19:42] (03CR) 10Volans: [C:03+2] "Docstring only diff since PS1 that was +1ed, merging." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 (owner: 10Volans) [10:22:08] (03CR) 10CI reject: [V:04-1] mysql_legacy: get systemd status for instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [10:22:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: post clone', diff saved to https://phabricator.wikimedia.org/P70465 and previous config saved to /var/cache/conftool/dbconfig/20241022-102227-arnaudb.json [10:27:08] (03CR) 10CI reject: [V:04-1] Update for Message/MessageValue changes [extensions/WikibaseQualityConstraints] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082181 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [10:27:51] (03PS2) 10JMeybohm: Migrate wikikube-worker208[5689] to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1081910 (https://phabricator.wikimedia.org/T362408) [10:27:54] (03CR) 10CI reject: [V:04-1] tests: Don't depend on Message implementation details [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082180 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [10:28:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T376905)', diff saved to https://phabricator.wikimedia.org/P70466 and previous config saved to /var/cache/conftool/dbconfig/20241022-102843-ladsgroup.json [10:28:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:29:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:29:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T376905)', diff saved to https://phabricator.wikimedia.org/P70467 and previous config saved to /var/cache/conftool/dbconfig/20241022-102907-ladsgroup.json [10:29:50] (03Merged) 10jenkins-bot: orchestrator: add a new module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 (owner: 10Volans) [10:37:26] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10249533 (10elukey) >>! In T376014#10227875, @gmodena wrote: >>>! In T376014#10203183, @... [10:37:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: post clone', diff saved to https://phabricator.wikimedia.org/P70468 and previous config saved to /var/cache/conftool/dbconfig/20241022-103733-arnaudb.json [10:38:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T376905)', diff saved to https://phabricator.wikimedia.org/P70469 and previous config saved to /var/cache/conftool/dbconfig/20241022-103822-ladsgroup.json [10:42:55] (03CR) 10Joely Rooke WMDE: [C:04-1] "Hi there, thanks for your interest in our change! We saw ruwiki (and many other Cyrillic wikis) already implement moving the link, so hope" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081421 (https://phabricator.wikimedia.org/T66315) (owner: 10Saint Johann) [10:45:04] (03CR) 10Saint Johann: "This patch was created on the assumption that something prevented the global change from being merged. If it will be merged, it is not nec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081421 (https://phabricator.wikimedia.org/T66315) (owner: 10Saint Johann) [10:45:36] (03Abandoned) 10Saint Johann: Add Russian Wikipedia to Wikidata link move [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081421 (https://phabricator.wikimedia.org/T66315) (owner: 10Saint Johann) [10:46:48] (03CR) 10Hashar: [V:03+2] tests: Don't depend on Message implementation details [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082180 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [10:46:53] (03CR) 10Hashar: [V:03+2] Update for Message/MessageValue changes [extensions/WikibaseQualityConstraints] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082181 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [10:48:00] (03CR) 10Clément Goubert: [C:03+2] sre.discovery.datacenter: Add failover_from action [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [10:49:23] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: sre.discovery.datacenter should support switching the active/passive services to the other datacenter - https://phabricator.wikimedia.org/T335364#10249621 (10Clement_Goubert) 05In progress→03Resolved [10:52:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: post clone', diff saved to https://phabricator.wikimedia.org/P70470 and previous config saved to /var/cache/conftool/dbconfig/20241022-105238-arnaudb.json [10:53:15] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10249636 (10hnowlan) >>! In T377773#10249377, @gmodena wrote: >>>! In T377773#10249354, @hnowlan wrote: >> Could you please specify which groups access is needed to? There are a few dumps groups but it ap... [10:53:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P70471 and previous config saved to /var/cache/conftool/dbconfig/20241022-105329-ladsgroup.json [10:54:11] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Add failover_from action [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [11:07:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: post clone', diff saved to https://phabricator.wikimedia.org/P70472 and previous config saved to /var/cache/conftool/dbconfig/20241022-110744-arnaudb.json [11:08:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P70473 and previous config saved to /var/cache/conftool/dbconfig/20241022-110836-ladsgroup.json [11:10:48] (03PS1) 10Arturo Borrero Gonzalez: toolforge: apt_pinning: drop more unused pinning [puppet] - 10https://gerrit.wikimedia.org/r/1082188 [11:19:20] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 (10MatthewVernon) 03NEW [11:19:32] (03CR) 10Muehlenhoff: [C:03+2] Drop the ircstream CNAME [dns] - 10https://gerrit.wikimedia.org/r/1082154 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:20:11] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10249725 (10MatthewVernon) [11:21:03] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: sync [11:21:15] jouncebot: nowandnext [11:21:15] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [11:21:15] In 0 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1200) [11:21:19] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: sync [11:21:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1081980 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [11:23:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T376905)', diff saved to https://phabricator.wikimedia.org/P70474 and previous config saved to /var/cache/conftool/dbconfig/20241022-112343-ladsgroup.json [11:23:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:24:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:24:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T376905)', diff saved to https://phabricator.wikimedia.org/P70475 and previous config saved to /var/cache/conftool/dbconfig/20241022-112408-ladsgroup.json [11:24:18] I’d like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/1082180 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/1082181 soon, if that’s okay with everyone [11:24:48] (but I’ll first wait a bit and see if hasharLunch comes back in time ^^ and in the meantime see if anyone objects) [11:26:03] (03CR) 10Lucas Werkmeister (WMDE): "Okay, then can we make this clearer in the README? 😇" [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [11:27:59] !log installing Java 11 security updates [11:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:35] (03PS1) 10Jelto: gitlab::runner: stop runner on gitlab-runner2002 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1082190 (https://phabricator.wikimedia.org/T377374) [11:29:55] (03CR) 10Slyngshede: [V:03+1 C:03+2] Allow CAS to have Redis supported enabled in overlay. [puppet] - 10https://gerrit.wikimedia.org/r/1081980 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [11:32:50] (03PS1) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) [11:33:20] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4331/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082190 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [11:33:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T376905)', diff saved to https://phabricator.wikimedia.org/P70476 and previous config saved to /var/cache/conftool/dbconfig/20241022-113342-ladsgroup.json [11:34:01] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host kubestagemaster2005.codfw.wmnet [11:34:02] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) check for host kubestagemaster2005.codfw.wmnet [11:37:40] (03PS1) 10KartikMistry: Update cxserver to 2024-10-22-112806-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082193 (https://phabricator.wikimedia.org/T357950) [11:39:36] !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Java 11 security updates - klausman@cumin2002 [11:40:54] Doing quick cxserver deployment.. [11:41:26] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-10-22-112806-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082193 (https://phabricator.wikimedia.org/T357950) (owner: 10KartikMistry) [11:41:41] !log remove faidon from WMCS projects maps, visualeditor, swift, testlabs per his request. Keep the bastion project. cc paravoid [11:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:45] (03Merged) 10jenkins-bot: Update cxserver to 2024-10-22-112806-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082193 (https://phabricator.wikimedia.org/T357950) (owner: 10KartikMistry) [11:43:05] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker2085.codfw.wmnet [11:43:07] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) check for host wikikube-worker2085.codfw.wmnet [11:43:58] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:44:07] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2009.codfw.wmnet [11:44:21] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:44:40] kart_: is it okay if I do a mediawiki deployment in parallel? [11:45:20] should be fine. [11:45:26] alright, thanks! [11:45:33] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:45:34] then I’ll try to deploy those changes I mentioned above [11:46:00] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "let’s see if I’m able to force-merge here to fix the broken CI cycle…" [extensions/WikibaseLexeme] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082180 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [11:46:01] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:46:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Update for Message/MessageValue changes [extensions/WikibaseQualityConstraints] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082181 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [11:46:29] (03PS1) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) [11:46:47] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "force-merging to fix broken CI (and then the actual deployment should proceed normally)" [extensions/WikibaseQualityConstraints] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082181 (https://phabricator.wikimedia.org/T377778) (owner: 10Lucas Werkmeister (WMDE)) [11:46:54] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:47:17] looks like scap backport is happy to run [11:47:28] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:47:31] Lucas_WMDE: I'm done as well :) [11:47:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1082180|tests: Don't depend on Message implementation details (T377778)]], [[gerrit:1082181|Update for Message/MessageValue changes (T377778)]] [11:47:37] yay :) [11:47:53] T377778: Wikibase CI tests are broken due to Wikimedia\Message\ScalarParam objects not the same in tests - https://phabricator.wikimedia.org/T377778 [11:48:10] !log Updated cxserver to 2024-10-22-112806-production (T357950) [11:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:18] T357950: Remove servicerunner dependency for cxserver - https://phabricator.wikimedia.org/T357950 [11:48:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P70477 and previous config saved to /var/cache/conftool/dbconfig/20241022-114849-ladsgroup.json [11:48:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:53:07] (03PS3) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) [11:54:28] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:55:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1082180|tests: Don't depend on Message implementation details (T377778)]], [[gerrit:1082181|Update for Message/MessageValue changes (T377778)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:55:10] testing… [11:55:11] T377778: Wikibase CI tests are broken due to Wikimedia\Message\ScalarParam objects not the same in tests - https://phabricator.wikimedia.org/T377778 [11:55:32] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2085-2086,2088-2089].codfw.wmnet [11:55:44] okay, I can reproduce the bug, and now on mwdebug… [11:55:58] fixed, yay [11:56:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [11:57:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Java 11 security updates - klausman@cumin2002 [11:57:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2085-2086,2088-2089].codfw.wmnet [11:58:59] (03PS4) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1200) [12:00:15] (03PS5) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) [12:00:44] (03CR) 10JMeybohm: "I've not renamed the coobook (yet) as it breaks the diff." [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:01:28] (03CR) 10JMeybohm: [C:03+2] Migrate wikikube-worker208[5689] to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1081910 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:01:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:02:31] !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Java 11 security updates - klausman@cumin2002 [12:03:03] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082180|tests: Don't depend on Message implementation details (T377778)]], [[gerrit:1082181|Update for Message/MessageValue changes (T377778)]] (duration: 15m 27s) [12:03:16] T377778: Wikibase CI tests are broken due to Wikimedia\Message\ScalarParam objects not the same in tests - https://phabricator.wikimedia.org/T377778 [12:03:19] * Lucas_WMDE done deploying [12:03:22] sorry I ran a bit into the next window [12:03:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P70478 and previous config saved to /var/cache/conftool/dbconfig/20241022-120356-ladsgroup.json [12:05:12] !log Running MediaModeration scan on all group0 wikis [12:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:16] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2085.codfw.wmnet with OS bookworm [12:06:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2086.codfw.wmnet with OS bookworm [12:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2149 and db2227 - T377718', diff saved to https://phabricator.wikimedia.org/P70479 and previous config saved to /var/cache/conftool/dbconfig/20241022-120753-arnaudb.json [12:07:57] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [12:08:18] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2088.codfw.wmnet with OS bookworm [12:09:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db[2149,2227].codfw.wmnet with reason: maintenance [12:09:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:09:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2009.codfw.wmnet [12:09:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2149,2227].codfw.wmnet with reason: maintenance [12:09:23] 06SRE, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10249950 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2009.codfw.wmnet` - ganeti2009.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertman... [12:09:58] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2089.codfw.wmnet with OS bookworm [12:12:07] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2010.codfw.wmnet [12:12:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 5%: T377718', diff saved to https://phabricator.wikimedia.org/P70480 and previous config saved to /var/cache/conftool/dbconfig/20241022-121218-arnaudb.json [12:12:35] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2227.codfw.wmnet [12:14:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10249964 (10MoritzMuehlenhoff) [12:15:10] (03PS1) 10Slyngshede: P:idp Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1082198 (https://phabricator.wikimedia.org/T367065) [12:17:14] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4332/console" [puppet] - 10https://gerrit.wikimedia.org/r/1082198 (https://phabricator.wikimedia.org/T367065) (owner: 10Slyngshede) [12:17:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:19:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T376905)', diff saved to https://phabricator.wikimedia.org/P70481 and previous config saved to /var/cache/conftool/dbconfig/20241022-121903-ladsgroup.json [12:19:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [12:19:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [12:19:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T376905)', diff saved to https://phabricator.wikimedia.org/P70482 and previous config saved to /var/cache/conftool/dbconfig/20241022-121928-ladsgroup.json [12:20:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Java 11 security updates - klausman@cumin2002 [12:20:37] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4333/console" [puppet] - 10https://gerrit.wikimedia.org/r/1082198 (https://phabricator.wikimedia.org/T367065) (owner: 10Slyngshede) [12:20:45] !log Running MediaModeration scan on all group1 wikis [12:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:58] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2010.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:24:17] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2088.codfw.wmnet with reason: host reimage [12:24:35] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab::runner: stop runner on gitlab-runner2002 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1082190 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [12:27:01] !log Stopped MediaModeration scan on all group1 wikis [12:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 10%: T377718', diff saved to https://phabricator.wikimedia.org/P70483 and previous config saved to /var/cache/conftool/dbconfig/20241022-122723-arnaudb.json [12:27:25] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2085.codfw.wmnet with reason: host reimage [12:27:26] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2086.codfw.wmnet with reason: host reimage [12:27:30] !log Running MediaModeration scan on all group2 wikis [12:27:32] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [12:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2088.codfw.wmnet with reason: host reimage [12:28:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2010.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:28:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2010.codfw.wmnet [12:28:39] 06SRE, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10250000 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2010.codfw.wmnet` - ganeti2010.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertman... [12:28:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T376905)', diff saved to https://phabricator.wikimedia.org/P70484 and previous config saved to /var/cache/conftool/dbconfig/20241022-122854-ladsgroup.json [12:31:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2085.codfw.wmnet with reason: host reimage [12:32:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:34:12] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2089.codfw.wmnet with reason: host reimage [12:34:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2086.codfw.wmnet with reason: host reimage [12:34:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:34:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:34:46] (03PS11) 10Arnaudb: mysql_legacy: get systemd status for instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) [12:34:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:35:22] (03Abandoned) 10Slyngshede: P:idp Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1082198 (https://phabricator.wikimedia.org/T367065) (owner: 10Slyngshede) [12:36:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2002.codfw.wmnet with OS bullseye [12:36:49] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host gitlab-runner2002 [12:37:44] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:37:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2089.codfw.wmnet with reason: host reimage [12:38:34] 06SRE, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10250021 (10MoritzMuehlenhoff) [12:39:27] (03CR) 10Arnaudb: [C:03+1] sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [12:40:02] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10250023 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [12:41:34] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner2002 - jelto@cumin1002" [12:41:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner2002 - jelto@cumin1002" [12:41:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache gitlab-runner2002.codfw.wmnet 161.16.192.10.in-addr.arpa 1.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:41:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-runner2002.codfw.wmnet 161.16.192.10.in-addr.arpa 1.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:41:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host gitlab-runner2002 [12:42:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gitlab-runner2002 [12:42:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host gitlab-runner2002 [12:42:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: T377718', diff saved to https://phabricator.wikimedia.org/P70485 and previous config saved to /var/cache/conftool/dbconfig/20241022-124228-arnaudb.json [12:42:36] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [12:42:50] (03CR) 10Slyngshede: "I question the value of this check. It only tests that an Apache instance with mod_auth_cas redirects to idp/idp-test. It says nothing abo" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [12:43:51] (03PS1) 10Arturo Borrero Gonzalez: wmcs: puppetserver: introduce apt pin for openjdk [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) [12:44:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P70486 and previous config saved to /var/cache/conftool/dbconfig/20241022-124401-ladsgroup.json [12:44:45] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) (owner: 10Arturo Borrero Gonzalez) [12:45:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2088.codfw.wmnet with OS bookworm [12:49:28] (03PS2) 10Arturo Borrero Gonzalez: wmcs: puppetserver: introduce apt pin for openjdk [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) [12:50:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2085.codfw.wmnet with OS bookworm [12:51:10] 06SRE, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378#10250056 (10Diskdance) FWIW, Cloudflare has [[ https://github.com/net4people/bbs/issues/393 | enabled ECH by default ]]. [12:52:04] 06SRE, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378#10250057 (10Diskdance) [12:52:50] 06SRE, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378#10250058 (10Diskdance) [12:53:23] (03PS1) 10Dreamy Jazz: Don't escape performer link HTML in GlobalBlockDetailsRenderer [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082202 (https://phabricator.wikimedia.org/T377398) [12:53:30] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2086.codfw.wmnet with OS bookworm [12:53:43] jouncebot: nowandnext [12:53:43] For the next 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1200) [12:53:43] In 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1300) [12:54:03] Will add my item to the window [12:54:20] !log aqu@deploy2002 Started deploy [analytics/refinery@ffc985a]: Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] [12:54:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082202 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [12:55:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2089.codfw.wmnet with OS bookworm [12:56:26] (03CR) 10Muehlenhoff: [C:03+1] "I'm not familiar with unattended-upgrades, but the apt pin looks good and this might work. Two typos inline" [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) (owner: 10Arturo Borrero Gonzalez) [12:57:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: T377718', diff saved to https://phabricator.wikimedia.org/P70487 and previous config saved to /var/cache/conftool/dbconfig/20241022-125734-arnaudb.json [12:57:39] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [12:58:04] (03PS2) 10Arturo Borrero Gonzalez: toolforge: apt_pinning: drop more unused pinning [puppet] - 10https://gerrit.wikimedia.org/r/1082188 [12:58:04] (03PS3) 10Arturo Borrero Gonzalez: wmcs: puppetserver: introduce apt pin for openjdk [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) [12:58:39] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [12:59:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P70488 and previous config saved to /var/cache/conftool/dbconfig/20241022-125908-ladsgroup.json [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1300). [13:00:05] HouseOfM, joelyrookewmde, and Dreamy Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] hi, I'm here ! [13:01:23] o/ [13:01:36] I've been here the whole time XD [13:01:42] \o [13:02:05] o/ [13:02:08] I can deploy! [13:02:19] huzzah [13:02:37] Great. Currently in a meeting, so wouldn't mind just doing the testing. [13:02:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [13:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078907 (https://phabricator.wikimedia.org/T376786) (owner: 10Mhorsey) [13:04:30] (03Merged) 10jenkins-bot: Release CampaignEvents to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078907 (https://phabricator.wikimedia.org/T376786) (owner: 10Mhorsey) [13:04:33] (03CR) 10FNegri: [C:03+1] "I don't know the context but I trust your memory, and it's great if we can remove those pins." [puppet] - 10https://gerrit.wikimedia.org/r/1082188 (owner: 10Arturo Borrero Gonzalez) [13:04:58] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1078907|Release CampaignEvents to eswiki (T376786)]] [13:05:03] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: apt_pinning: drop more unused pinning [puppet] - 10https://gerrit.wikimedia.org/r/1082188 (owner: 10Arturo Borrero Gonzalez) [13:05:07] T376786: Release CampaignEvents extension to Spanish Wikipedia [Oct 22, 14:00 UTC] - https://phabricator.wikimedia.org/T376786 [13:07:33] !log lucaswerkmeister-wmde@deploy2002 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1078907|Release CampaignEvents to eswiki (T376786)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082202 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [13:07:40] (03PS1) 10STran: Support template overrides in ContributionsPager [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) [13:07:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) (owner: 10Arturo Borrero Gonzalez) [13:07:49] HouseOfM: can you test the eswiki change? [13:08:30] testing [13:09:54] LGTM :) [13:09:57] !log lucaswerkmeister-wmde@deploy2002 mhorsey, lucaswerkmeister-wmde: Continuing with sync [13:09:59] ok :) [13:10:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [13:12:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: T377718', diff saved to https://phabricator.wikimedia.org/P70489 and previous config saved to /var/cache/conftool/dbconfig/20241022-131239-arnaudb.json [13:12:45] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [13:13:24] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10250114 (10MatthewVernon) There are a few nodes with getting-full drives: ` ms-be1063 sdb3 84% ms-be1065 sda3 88% ms-be1066 sdb3 (91%)... [13:14:01] !log aqu@deploy2002 Finished deploy [analytics/refinery@ffc985a]: Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] (duration: 19m 41s) [13:14:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T376905)', diff saved to https://phabricator.wikimedia.org/P70490 and previous config saved to /var/cache/conftool/dbconfig/20241022-131415-ladsgroup.json [13:14:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [13:14:33] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078907|Release CampaignEvents to eswiki (T376786)]] (duration: 09m 35s) [13:14:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [13:14:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:14:37] T376786: Release CampaignEvents extension to Spanish Wikipedia [Oct 22, 14:00 UTC] - https://phabricator.wikimedia.org/T376786 [13:14:39] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10250115 (10MatthewVernon) ...thought we're still left with the question of why we have some distinctly over-loaded drives. [13:14:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:14:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T376905)', diff saved to https://phabricator.wikimedia.org/P70491 and previous config saved to /var/cache/conftool/dbconfig/20241022-131448-ladsgroup.json [13:15:10] zuul says ETA 4 minutes on GlobalBlocking gate-and-submit, let’s do that before the other config change then [13:15:18] so it doesn’t merge during the deployment [13:15:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082202 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [13:16:15] (03Merged) 10jenkins-bot: Don't escape performer link HTML in GlobalBlockDetailsRenderer [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082202 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [13:16:22] (03PS2) 10Ladsgroup: tables-catalog: Remodel databases for non-default or non-core tables [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) [13:16:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1082202|Don't escape performer link HTML in GlobalBlockDetailsRenderer (T377398)]] [13:16:50] (03PS6) 10Vgutierrez: liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) [13:16:50] T377398: Special:Contributions global block notice displays incorrectly - https://phabricator.wikimedia.org/T377398 [13:16:50] (03PS2) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [13:17:17] (03CR) 10Ladsgroup: "Updated it. Let me know if it's not good." [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:17:28] (03CR) 10Vgutierrez: [C:03+1] eqsin: haproxy: switch to array-type gpc_rate counters [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [13:18:14] (03CR) 10CDanis: [C:03+2] eqsin: haproxy: switch to array-type gpc_rate counters [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [13:18:29] (03CR) 10CDanis: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [13:19:05] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2085-2086,2088-2089].codfw.wmnet [13:19:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamyjazz: Backport for [[gerrit:1082202|Don't escape performer link HTML in GlobalBlockDetailsRenderer (T377398)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:19:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2085-2086,2088-2089].codfw.wmnet [13:19:21] Testing. [13:19:29] (03CR) 10CI reject: [V:04-1] profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [13:19:42] thanks [13:21:27] The fix worked, but broke something else. [13:21:34] hm :/ [13:21:36] So I don't think we should proceed [13:21:53] ok… so cancel the scap and then scap backport --revert? [13:22:07] (unless you want to upload a revert yourself) [13:22:08] Yeah. I think so. [13:22:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10250152 (10jcrespo) Thanks, I can start provisioning it, however, there seems to be an issue with the disk monitoring. Our puppet installation see... [13:22:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [13:22:41] I will be making a fix to fix the second problem now, so not sure what the best plan is. [13:22:41] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10250153 (10MatthewVernon) comparing two nodes, ms-be1066: ` root@ms-be1066:~# du -sh /srv/swift-storage/sdb3/containers/* | grep -c G... [13:22:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [13:23:18] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve2011.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:24:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T376905)', diff saved to https://phabricator.wikimedia.org/P70492 and previous config saved to /var/cache/conftool/dbconfig/20241022-132409-ladsgroup.json [13:24:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [13:24:53] wmf.28 is only on test wikis at the moment, right? [13:25:02] so maybe I could sync the change after all, and then continue with the config change [13:25:09] and afterwards either deploy your fix or revert it after all [13:25:18] Yeah, it's only on test wiki [13:25:20] but it would be okay if the broken version is on test wikis for half an hour or so [13:25:25] (i assume) [13:25:28] It's not broken in a way which is bad [13:25:34] Just a link which doesn't work. [13:25:45] ok, then let’s do that [13:25:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamyjazz: Continuing with sync [13:27:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: T377718', diff saved to https://phabricator.wikimedia.org/P70493 and previous config saved to /var/cache/conftool/dbconfig/20241022-132745-arnaudb.json [13:27:50] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [13:29:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2011.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:29:50] FIRING: KubernetesCalicoDown: ml-serve2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2011.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:30:23] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve2011.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:30:24] (03PS1) 10Jelto: Revert "gitlab::runner: stop runner on gitlab-runner2002 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082205 [13:30:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2011.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:30:54] (03PS2) 10Jelto: Revert "gitlab::runner: stop runner on gitlab-runner2002 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082205 (https://phabricator.wikimedia.org/T377374) [13:31:02] (03CR) 10Lucas Werkmeister (WMDE): tables-catalog: Remodel databases for non-default or non-core tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:31:12] (03CR) 10Lucas Werkmeister (WMDE): "README looks good to me now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:31:45] (03CR) 10Jelto: [C:03+2] Revert "gitlab::runner: stop runner on gitlab-runner2002 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082205 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [13:32:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10250189 (10Papaul) @elukey thank you for putting this together . There is one thing i am truing to understand here " Wait for the update and BMC reset A... [13:32:12] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082202|Don't escape performer link HTML in GlobalBlockDetailsRenderer (T377398)]] (duration: 15m 27s) [13:32:17] T377398: Special:Contributions global block notice displays incorrectly - https://phabricator.wikimedia.org/T377398 [13:32:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [13:33:44] ok, let’s continue with joelyrookewmde then! [13:33:56] great :) [13:34:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081995 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:34:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2149.codfw.wmnet onto db2227.codfw.wmnet [13:34:50] RESOLVED: KubernetesCalicoDown: ml-serve2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2011.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:35:01] (03Merged) 10jenkins-bot: Activate feature flag to default move wikibase sidebar link to other projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081995 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:35:25] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1081995|Activate feature flag to default move wikibase sidebar link to other projects. (T66315)]] [13:35:37] T66315: Move "Data item" link into In Other Projects section of sidebar - https://phabricator.wikimedia.org/T66315 [13:36:29] FIRING: [17x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:33] (03PS1) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [13:37:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10250234 (10elukey) >>! In T371400#10250189, @Papaul wrote: > @elukey thank you for putting this together . There is one thing i am truing to understand... [13:37:50] !log lucaswerkmeister-wmde@deploy2002 joelyrookewmde, lucaswerkmeister-wmde: Backport for [[gerrit:1081995|Activate feature flag to default move wikibase sidebar link to other projects. (T66315)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:38:00] joelyrookewmde: please test on mwdebug :) [13:38:03] wilco [13:38:09] (03PS1) 10Jelto: docker_registry_ha::registry: update gitlab-runner2002 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082208 (https://phabricator.wikimedia.org/T377374) [13:38:45] (03PS1) 10Slyngshede: P:idp ensure defaults for Redis is present for all deployment. [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) [13:39:15] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:39:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P70494 and previous config saved to /var/cache/conftool/dbconfig/20241022-133916-ladsgroup.json [13:39:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2002.codfw.wmnet with OS bullseye [13:39:23] I'm happy with that [13:39:29] ok! [13:39:31] !log lucaswerkmeister-wmde@deploy2002 joelyrookewmde, lucaswerkmeister-wmde: Continuing with sync [13:39:39] (03PS3) 10Ladsgroup: tables-catalog: Remodel databases for non-default or non-core tables [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) [13:39:51] (03CR) 10Ladsgroup: tables-catalog: Remodel databases for non-default or non-core tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:41:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 5%: T377718', diff saved to https://phabricator.wikimedia.org/P70495 and previous config saved to /var/cache/conftool/dbconfig/20241022-134126-arnaudb.json [13:41:45] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [13:41:58] !log aqu@deploy2002 Started deploy [analytics/refinery@ffc985a] (thin): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] [13:42:22] (03PS3) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [13:43:24] (03PS1) 10Brouberol: airflow: define useful fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082211 [13:44:06] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081995|Activate feature flag to default move wikibase sidebar link to other projects. (T66315)]] (duration: 08m 40s) [13:44:08] (03CR) 10Vgutierrez: profile: Provide a liberica profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [13:44:37] T66315: Move "Data item" link into In Other Projects section of sidebar - https://phabricator.wikimedia.org/T66315 [13:45:02] (03PS2) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [13:45:20] FIRING: [2x] KubernetesCalicoDown: ml-serve2010.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:45:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:45:48] !log aqu@deploy2002 deploy aborted: Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] (duration: 03m 50s) [13:45:58] !log aqu@deploy2002 Started deploy [analytics/refinery@ffc985a] (thin): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] [13:46:13] (03PS24) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [13:46:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [13:46:37] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:46:44] (03PS2) 10Brouberol: airflow: define useful fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082211 [13:46:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:46:56] !log aqu@deploy2002 Finished deploy [analytics/refinery@ffc985a] (thin): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] (duration: 00m 57s) [13:47:11] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] tables-catalog: Remodel databases for non-default or non-core tables [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:47:23] !log aqu@deploy2002 Started deploy [analytics/refinery@ffc985a] (thin): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] [13:47:30] !log aqu@deploy2002 Finished deploy [analytics/refinery@ffc985a] (thin): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] (duration: 00m 07s) [13:48:05] !log aqu@deploy2002 Started deploy [analytics/refinery@ffc985a] (hadoop-test): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] [13:48:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [13:48:34] (03PS1) 10Slyngshede: P:ircstream close IRC connection nicely after probe [puppet] - 10https://gerrit.wikimedia.org/r/1082212 [13:49:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [13:49:57] (03PS4) 10Ladsgroup: tables-catalog: Remodel databases for non-default or non-core tables [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) [13:50:05] RESOLVED: [2x] KubernetesCalicoDown: ml-serve2010.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:50:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Remodel databases for non-default or non-core tables [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:50:10] jouncebot: next [13:50:10] In 1 hour(s) and 9 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1500) [13:50:36] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10250269 (10xcollazo) dumpsdata* and snapshot* are the required hosts for `dumpsgen`. @gmodena can you try: ` ssh snapshot1014.eqiad.wmnet sudo -l ` And list here the output of `sudo -l`? [13:51:11] Dreamy_Jazz: do you think the fix will be ready soon, or should I revert for now and then you can deploy it later? [13:51:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: post clone', diff saved to https://phabricator.wikimedia.org/P70496 and previous config saved to /var/cache/conftool/dbconfig/20241022-135112-arnaudb.json [13:51:22] !log aqu@deploy2002 Finished deploy [analytics/refinery@ffc985a] (hadoop-test): Adding refinery/source 0.2.49.2 & 0.2.53 [analytics/refinery@ffc985a7] (duration: 03m 17s) [13:51:34] I think you can leave my change in. I am currently writing the second fix and should push it shortly. [13:51:41] ok [13:51:43] I can then self-deploy it outside the window. [13:51:51] alright [13:52:14] !log UTC afternoon backport+window done (a further GlobalBlocking fix will be backported out-of-window soon) [13:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P70497 and previous config saved to /var/cache/conftool/dbconfig/20241022-135424-ladsgroup.json [13:56:08] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10250298 (10elukey) [13:56:15] (03PS1) 10Slyngshede: P:ircstream temporarily disable alerting [puppet] - 10https://gerrit.wikimedia.org/r/1082214 [13:56:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 10%: T377718', diff saved to https://phabricator.wikimedia.org/P70498 and previous config saved to /var/cache/conftool/dbconfig/20241022-135631-arnaudb.json [13:56:48] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [13:57:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [13:57:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [13:59:13] !log rebalance ganeti clusters in magru following reboots [13:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:47] (03CR) 10Herron: "Thanks for this! LGTM overall, although I think worth a run of the PCC to check for unexpected changes" [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) (owner: 10Tiziano Fogli) [14:01:29] FIRING: [17x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082214 (owner: 10Slyngshede) [14:02:33] (03CR) 10Slyngshede: [C:03+2] P:ircstream temporarily disable alerting [puppet] - 10https://gerrit.wikimedia.org/r/1082214 (owner: 10Slyngshede) [14:03:38] (03CR) 10Bking: [C:03+1] airflow: define useful fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082211 (owner: 10Brouberol) [14:03:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [14:06:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: post clone', diff saved to https://phabricator.wikimedia.org/P70499 and previous config saved to /var/cache/conftool/dbconfig/20241022-140617-arnaudb.json [14:07:37] (03CR) 10Brouberol: [C:03+2] airflow: define useful fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082211 (owner: 10Brouberol) [14:07:48] (03PS3) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [14:08:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet [14:09:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T376905)', diff saved to https://phabricator.wikimedia.org/P70500 and previous config saved to /var/cache/conftool/dbconfig/20241022-140931-ladsgroup.json [14:09:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [14:09:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [14:09:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T376905)', diff saved to https://phabricator.wikimedia.org/P70501 and previous config saved to /var/cache/conftool/dbconfig/20241022-140956-ladsgroup.json [14:10:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [14:10:24] (03PS1) 10Aqu: [analytics][refine] Use deduplication fix backport in legacy Refine job [puppet] - 10https://gerrit.wikimedia.org/r/1082218 (https://phabricator.wikimedia.org/T369845) [14:10:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2011.codfw.wmnet [14:10:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [14:11:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: T377718', diff saved to https://phabricator.wikimedia.org/P70502 and previous config saved to /var/cache/conftool/dbconfig/20241022-141137-arnaudb.json [14:11:42] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [14:11:42] (03CR) 10Btullis: statistics::explorer hosts: refactor and improve cgroups implementation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [14:12:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10250385 (10ops-monitoring-bot) Draining ganeti2011.codfw.wmnet of running VMs [14:12:59] (03PS2) 10Slyngshede: P:idp ensure defaults for Redis is present for all deployment. [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) [14:13:03] (03CR) 10Bking: airflow: define an optional airflow-kerberos Deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:13:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1076665 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:13:46] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:13:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076666 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:15:47] (03PS1) 10Dreamy Jazz: Fix performer link on Special:GlobalBlockList [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082219 (https://phabricator.wikimedia.org/T377398) [14:15:55] (03CR) 10Dreamy Jazz: [C:03+2] Fix performer link on Special:GlobalBlockList [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082219 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [14:16:00] jouncebot: nowandnext [14:16:00] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [14:16:00] In 0 hour(s) and 43 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1500) [14:16:09] Going to deploy the follow-up fix now [14:16:14] \o/ [14:16:38] (03CR) 10Brouberol: airflow: define an optional airflow-kerberos Deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:18:30] (03PS3) 10Tiziano Fogli: prometheus/cadvisor: lookup extra metrics from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) [14:18:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T376905)', diff saved to https://phabricator.wikimedia.org/P70503 and previous config saved to /var/cache/conftool/dbconfig/20241022-141848-ladsgroup.json [14:20:14] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) (owner: 10Tiziano Fogli) [14:21:01] (03CR) 10Bking: [C:03+1] airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:21:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: post clone', diff saved to https://phabricator.wikimedia.org/P70504 and previous config saved to /var/cache/conftool/dbconfig/20241022-142123-arnaudb.json [14:22:10] (03PS3) 10Arturo Borrero Gonzalez: P:idp ensure defaults for Redis is present for all deployment. [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:22:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082219 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [14:22:13] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:24:24] (03Merged) 10jenkins-bot: Fix performer link on Special:GlobalBlockList [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082219 (https://phabricator.wikimedia.org/T377398) (owner: 10Dreamy Jazz) [14:24:51] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1082219|Fix performer link on Special:GlobalBlockList (T377398)]] [14:24:55] T377398: Special:Contributions global block notice displays incorrectly - https://phabricator.wikimedia.org/T377398 [14:25:21] (03CR) 10Btullis: airflow: define an optional airflow-kerberos Deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:25:44] (03PS4) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [14:26:31] (03CR) 10CI reject: [V:04-1] airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:26:37] (03CR) 10Brouberol: airflow: define an optional airflow-kerberos Deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:26:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: T377718', diff saved to https://phabricator.wikimedia.org/P70505 and previous config saved to /var/cache/conftool/dbconfig/20241022-142642-arnaudb.json [14:26:50] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [14:26:59] (03CR) 10Muehlenhoff: "This should go into cloud.yaml, we already have all the other cloud defaults there." [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:27:26] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1082219|Fix performer link on Special:GlobalBlockList (T377398)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:27:52] It works this time :D [14:27:54] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:28:43] (03PS5) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [14:29:33] (03CR) 10Tiziano Fogli: "Thank you @kherron@wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) (owner: 10Tiziano Fogli) [14:29:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:29:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:29:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:29:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:30:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70506 and previous config saved to /var/cache/conftool/dbconfig/20241022-143005-arnaudb.json [14:30:06] (03CR) 10CI reject: [V:04-1] airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:30:19] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:30:40] (03PS6) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [14:31:05] (03PS4) 10Slyngshede: P:idp ensure defaults for Redis is present for all deployment. [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) [14:31:06] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:31:08] (03PS7) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [14:32:34] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082219|Fix performer link on Special:GlobalBlockList (T377398)]] (duration: 07m 43s) [14:32:34] jouncebot: now [14:32:34] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [14:32:39] T377398: Special:Contributions global block notice displays incorrectly - https://phabricator.wikimedia.org/T377398 [14:32:47] wondering if it’s worth backporting the fix for T377533 [14:32:47] T377533: Recent changes doesn't have space after target title and before the reason part - https://phabricator.wikimedia.org/T377533 [14:32:55] it’s not a serious issue but a lot of people seem to be noticing it ^^ [14:33:48] Seems small enough that a backport wouldn't be risky [14:33:50] (03PS5) 10Slyngshede: P:idp ensure defaults for Redis is present for all deployment. [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) [14:33:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P70507 and previous config saved to /var/cache/conftool/dbconfig/20241022-143355-ladsgroup.json [14:34:07] !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [14:34:11] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [14:34:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [14:34:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:35:09] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:36:08] (03CR) 10Volans: "Just make sure to follow https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook once merged ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:36:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: post clone', diff saved to https://phabricator.wikimedia.org/P70509 and previous config saved to /var/cache/conftool/dbconfig/20241022-143628-arnaudb.json [14:36:36] on the other hand, core CI takes a while :/ [14:36:46] I think I’ve talked myself out of it, meh ^^ [14:36:54] rolling out with the train should be okay [14:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:16] Yeah, it's not really important to fix [14:37:25] If it's still broken, then so be it. [14:37:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2084 to codfw - jhancock@cumin2002" [14:37:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2084 to codfw - jhancock@cumin2002" [14:37:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:12] (03CR) 10Jelto: [C:03+2] profile::firewall: separate ipv4 and ipv6 in nftables BLOCKED_NETS [puppet] - 10https://gerrit.wikimedia.org/r/1076665 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:38:15] (03CR) 10Jelto: [V:03+1 C:03+2] sretest: test defs_from_etcd with new separate sets [puppet] - 10https://gerrit.wikimedia.org/r/1076666 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:40:26] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [14:40:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [14:41:34] (03CR) 10Brouberol: airflow: define an optional airflow-kerberos Deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [14:41:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: T377718', diff saved to https://phabricator.wikimedia.org/P70510 and previous config saved to /var/cache/conftool/dbconfig/20241022-144148-arnaudb.json [14:41:53] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [14:42:12] (03PS21) 10Ebomani: Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) [14:42:12] (03CR) 10Ebomani: "Hi Antoine, thank you so much for the detailed comments, great catch on the legacy bit too! I agree with your thoughts on combining the re" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [14:42:54] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853 (10jcrespo) 03NEW [14:45:48] (03PS2) 10Hnowlan: sessionstore: use service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) [14:45:51] (03CR) 10Brouberol: "Looks good! Can you bump the chart version?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082163 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [14:46:10] (03CR) 10Hnowlan: "Oh, good catch! Thank you, done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [14:46:16] (03CR) 10Jelto: [V:03+1 C:03+2] "This is not working, `etc/nftables/sets/requestctl.nft` is empty because confd throws the following error:" [puppet] - 10https://gerrit.wikimedia.org/r/1076666 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:46:56] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10250613 (10jcrespo) [14:47:12] (03PS1) 10Jelto: Revert "sretest: test defs_from_etcd with new separate sets" [puppet] - 10https://gerrit.wikimedia.org/r/1082224 (https://phabricator.wikimedia.org/T348734) [14:47:16] (03PS1) 10Jelto: Revert "profile::firewall: separate ipv4 and ipv6 in nftables BL..." [puppet] - 10https://gerrit.wikimedia.org/r/1082225 (https://phabricator.wikimedia.org/T348734) [14:48:08] (03CR) 10Hashar: [C:03+2] Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [14:48:48] (03Merged) 10jenkins-bot: Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [14:49:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P70511 and previous config saved to /var/cache/conftool/dbconfig/20241022-144902-ladsgroup.json [14:49:21] (03PS1) 10Majavah: P:wmcs::nfs::standalone: Fix link [puppet] - 10https://gerrit.wikimedia.org/r/1082226 [14:49:27] (03CR) 10Jelto: [C:03+2] Revert "profile::firewall: separate ipv4 and ipv6 in nftables BL..." [puppet] - 10https://gerrit.wikimedia.org/r/1082225 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:49:30] (03CR) 10Jelto: [C:03+2] Revert "sretest: test defs_from_etcd with new separate sets" [puppet] - 10https://gerrit.wikimedia.org/r/1082224 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:50:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [14:51:21] (03CR) 10CI reject: [V:04-1] P:wmcs::nfs::standalone: Fix link [puppet] - 10https://gerrit.wikimedia.org/r/1082226 (owner: 10Majavah) [14:52:06] !log hashar@deploy2002 Started deploy [gerrit/gerrit@30691f2]: Update patch demo to recognize both legacy and new URLs - T374954 [14:52:11] T374954: Update gerrit plugin for PatchDemo - https://phabricator.wikimedia.org/T374954 [14:52:16] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@30691f2]: Update patch demo to recognize both legacy and new URLs - T374954 (duration: 00m 10s) [14:52:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10250655 (10Jgreen) There are 6 servers being replaced: {T369565} {T369947} {T369947} Plus 3 new servers: {T367820} [14:52:46] (03PS2) 10Majavah: P:wmcs::nfs::standalone: Fix link [puppet] - 10https://gerrit.wikimedia.org/r/1082226 [14:52:46] (03PS1) 10Majavah: P:wmcs::nfs: Format with black [puppet] - 10https://gerrit.wikimedia.org/r/1082227 [14:53:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:53:56] (03CR) 10Scott French: [C:03+1] "Great, that should do it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [14:55:18] (03CR) 10CI reject: [V:04-1] P:wmcs::nfs: Format with black [puppet] - 10https://gerrit.wikimedia.org/r/1082227 (owner: 10Majavah) [14:56:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: T377718', diff saved to https://phabricator.wikimedia.org/P70512 and previous config saved to /var/cache/conftool/dbconfig/20241022-145653-arnaudb.json [14:57:02] T377718: db2205 and db2227 need to be recloned from 10.6.17 - https://phabricator.wikimedia.org/T377718 [14:57:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:58:04] (03PS2) 10Majavah: P:wmcs::nfs: Format with black [puppet] - 10https://gerrit.wikimedia.org/r/1082227 [14:59:05] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082229 [15:00:05] eoghan, jelto, arnoldokoth, and mutante: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1500). [15:01:19] (03CR) 10CI reject: [V:04-1] P:wmcs::nfs: Format with black [puppet] - 10https://gerrit.wikimedia.org/r/1082227 (owner: 10Majavah) [15:02:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10250732 (10Jelto) Unfortunately the last attempt of separating ipv4 and ipv6 also failed. I tried to fill two sets `BLOCKED_NETS_ipv4` and `B... [15:03:37] (03CR) 10Brouberol: airflow: define an optional airflow-kerberos Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [15:04:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T376905)', diff saved to https://phabricator.wikimedia.org/P70513 and previous config saved to /var/cache/conftool/dbconfig/20241022-150409-ladsgroup.json [15:04:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10250730 (10jcrespo) I filed T377853 with a possible fix. [15:04:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [15:04:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [15:04:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T376905)', diff saved to https://phabricator.wikimedia.org/P70514 and previous config saved to /var/cache/conftool/dbconfig/20241022-150435-ladsgroup.json [15:04:35] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081950 (owner: 10PipelineBot) [15:04:39] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081578 (owner: 10PipelineBot) [15:04:42] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081128 (owner: 10PipelineBot) [15:06:31] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deployment [15:06:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:06:45] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deployment [15:06:51] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phabricator.wikimedia.org with reason: Phabricator deployment [15:06:52] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phabricator.wikimedia.org with reason: Phabricator deployment [15:07:09] !log brennen@deploy2002 Started deploy [phabricator/deployment@582cde5]: test deploy phab2002 for T377850 (may fail, expected) [15:07:14] T377850: Deploy Phabricator/Phorge 2024-10-22 - https://phabricator.wikimedia.org/T377850 [15:07:21] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator deployment [15:07:22] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator deployment [15:07:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:07:34] !log brennen@deploy2002 Finished deploy [phabricator/deployment@582cde5]: test deploy phab2002 for T377850 (may fail, expected) (duration: 00m 24s) [15:08:05] !log brennen@deploy2002 Started deploy [phabricator/deployment@582cde5]: deploy phab1004 for T377850 [15:08:19] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082229 (owner: 10PipelineBot) [15:08:59] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082230 (https://phabricator.wikimedia.org/T373861) [15:09:09] !log brennen@deploy2002 Finished deploy [phabricator/deployment@582cde5]: deploy phab1004 for T377850 (duration: 01m 04s) [15:09:23] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082229 (owner: 10PipelineBot) [15:10:23] (03PS1) 10Arturo Borrero Gonzalez: toolforge: apt_pinning: fix repeated entry [puppet] - 10https://gerrit.wikimedia.org/r/1082233 [15:10:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082233 (owner: 10Arturo Borrero Gonzalez) [15:10:43] !log gmodena@deploy2002 Started deploy [airflow-dags/analytics@7c2d65f]: DPE 2024-10-22 deployment train [15:11:08] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10250760 (10jcrespo) After testing on older hosts, storecli seems to work on older hosts from a diff... [15:11:17] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082234 (https://phabricator.wikimedia.org/T373861) [15:11:23] !log gmodena@deploy2002 Finished deploy [airflow-dags/analytics@7c2d65f]: DPE 2024-10-22 deployment train (duration: 01m 16s) [15:12:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T376905)', diff saved to https://phabricator.wikimedia.org/P70515 and previous config saved to /var/cache/conftool/dbconfig/20241022-151237-ladsgroup.json [15:12:40] (03CR) 10Herron: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) (owner: 10Tiziano Fogli) [15:13:15] (03PS1) 10CDanis: haproxy: gpc_rate arrays to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) [15:13:21] (03PS2) 10Arturo Borrero Gonzalez: toolforge: apt_pinning: fix repeated entry [puppet] - 10https://gerrit.wikimedia.org/r/1082233 [15:13:23] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [15:14:14] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host kubestagemaster2003.codfw.wmnet [15:14:15] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) check for host kubestagemaster2003.codfw.wmnet [15:14:17] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082233 (owner: 10Arturo Borrero Gonzalez) [15:14:18] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [15:14:41] (03PS2) 10CDanis: haproxy: gpc_rate arrays to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) [15:14:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [15:15:01] !log Deployed refinery using scap, then deployed onto hdfs [15:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] (03CR) 10Btullis: [C:03+1] "Except for 1 nit, all good to go." [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:16:10] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: apt_pinning: fix repeated entry [puppet] - 10https://gerrit.wikimedia.org/r/1082233 (owner: 10Arturo Borrero Gonzalez) [15:16:15] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082230 (https://phabricator.wikimedia.org/T373861) (owner: 10Clare Ming) [15:16:16] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082234 (https://phabricator.wikimedia.org/T373861) (owner: 10Clare Ming) [15:16:50] (03CR) 10CDanis: [C:04-2] "pcc diff looks good to go but will wait until Thursday to roll out past eqsin, as discussed" [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [15:17:14] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082230 (https://phabricator.wikimedia.org/T373861) (owner: 10Clare Ming) [15:17:17] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082234 (https://phabricator.wikimedia.org/T373861) (owner: 10Clare Ming) [15:18:21] (03PS1) 10Mvolz: Update Zotero to node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082237 [15:18:24] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:18:57] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:19:05] (03PS3) 10Majavah: pybal: Use `is` instead of `==` for comparing Python types [puppet] - 10https://gerrit.wikimedia.org/r/1076420 [15:19:05] (03PS3) 10Majavah: P:wmcs::nfs: Format with black [puppet] - 10https://gerrit.wikimedia.org/r/1082227 [15:19:05] (03PS4) 10Majavah: P:wmcs::nfs::standalone: Fix link [puppet] - 10https://gerrit.wikimedia.org/r/1082226 [15:19:07] (03PS1) 10Vgutierrez: service: Set depool_threshold as a float [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) [15:19:18] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:19:37] (03PS2) 10Vgutierrez: wmflib::service: Set depool_threshold as a float [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) [15:19:51] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:20:14] (03PS1) 10Majavah: openstack: Remove OATHAuth 2FA (wmtotp) support [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) [15:23:18] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10250885 (10gmodena) >>! In T377773#10250269, @xcollazo wrote: > dumpsdata* and snapshot* are the required hosts for `dumpsgen`. > > @gmodena can you try: > ` > ssh snapshot1014.eqiad.wmnet > sudo -l > `... [15:24:25] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10250889 (10ssingh) Hi: As an update, this is pending approval so we are working on that internally and will merge this once that is done. Thanks! [15:24:29] (03CR) 10CI reject: [V:04-1] openstack: Remove OATHAuth 2FA (wmtotp) support [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) (owner: 10Majavah) [15:25:03] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) (owner: 10Majavah) [15:26:07] (03CR) 10Mmartorana: [C:03+2] Update miscweb: security-landing-page to latest image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082089 (https://phabricator.wikimedia.org/T377168) (owner: 10SBassett) [15:26:10] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10250918 (10xcollazo) Great, so you're good in terms of dumpsdata* and snapshot*. [15:26:32] (03CR) 10Mmartorana: [V:03+2 C:03+2] Update miscweb: security-landing-page to latest image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082089 (https://phabricator.wikimedia.org/T377168) (owner: 10SBassett) [15:27:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P70516 and previous config saved to /var/cache/conftool/dbconfig/20241022-152743-ladsgroup.json [15:30:11] (03PS25) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [15:30:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70517 and previous config saved to /var/cache/conftool/dbconfig/20241022-153031-arnaudb.json [15:30:35] (03CR) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:30:35] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:30:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:31:52] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:32:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [15:32:42] (03CR) 10Bking: [C:03+2] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:32:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10250979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [15:34:12] (03CR) 10Ssingh: [C:03+1] pybal: Use `is` instead of `==` for comparing Python types [puppet] - 10https://gerrit.wikimedia.org/r/1076420 (owner: 10Majavah) [15:35:49] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:36:36] !log sbassett@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:36:44] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:36:54] (03CR) 10Clément Goubert: k8s.pool-depool-node: Add support for multiple nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [15:36:58] !log sbassett@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:37:16] !log sbassett@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:37:35] !log sbassett@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:37:37] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10251006 (10jcrespo) perccli and storecli are not exactly the same either, existing script fails wit... [15:38:24] !log sbassett@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:38:38] (03PS3) 10Vgutierrez: wmflib::service: Set depool_threshold as a float [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) [15:38:43] !log sbassett@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:38:48] !log sbassett@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:38:50] !log sbassett@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:38:58] !log sbassett@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:39:01] !log sbassett@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:39:46] (03PS16) 10BCornwall: varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [15:40:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [15:41:52] (03CR) 10Majavah: [C:03+2] pybal: Use `is` instead of `==` for comparing Python types [puppet] - 10https://gerrit.wikimedia.org/r/1076420 (owner: 10Majavah) [15:42:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P70518 and previous config saved to /var/cache/conftool/dbconfig/20241022-154251-ladsgroup.json [15:44:50] (03CR) 10Ssingh: [C:03+1] varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [15:45:26] (03CR) 10Majavah: [C:03+2] P:wmcs::nfs: Format with black [puppet] - 10https://gerrit.wikimedia.org/r/1082227 (owner: 10Majavah) [15:45:37] (03CR) 10Majavah: [C:03+2] P:wmcs::nfs::standalone: Fix link [puppet] - 10https://gerrit.wikimedia.org/r/1082226 (owner: 10Majavah) [15:45:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70519 and previous config saved to /var/cache/conftool/dbconfig/20241022-154538-arnaudb.json [15:46:51] (03PS3) 10Majavah: openstack: Fix flake8 lint errors [puppet] - 10https://gerrit.wikimedia.org/r/1076419 [15:46:51] (03PS13) 10Majavah: taskgen: Only run Python3 tests [puppet] - 10https://gerrit.wikimedia.org/r/954267 [15:47:29] (03CR) 10CI reject: [V:04-1] openstack: Fix flake8 lint errors [puppet] - 10https://gerrit.wikimedia.org/r/1076419 (owner: 10Majavah) [15:49:32] (03CR) 10CI reject: [V:04-1] taskgen: Only run Python3 tests [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [15:50:48] (03CR) 10Vgutierrez: [C:03+1] P:trafficserver: extend x-wikimedia-debug-routing for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1072638 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [15:52:16] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:52:37] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:53:35] !log hnowlan@cumin1002 START - Cookbook sre.discovery.service-route check sessionstore: maintenance [15:53:35] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check sessionstore: maintenance [15:53:35] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [15:53:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org [15:53:50] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [15:53:51] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10251088 (10jcrespo) How far is this in the queue? The original need by was 2024-09-08, and this could help debug issues with: T377853 [15:55:09] (03PS1) 10Muehlenhoff: Remove ganeti2011 from active Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082240 (https://phabricator.wikimedia.org/T376594) [15:55:27] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10251100 (10Jhancock.wm) [15:55:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet [15:56:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082240 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [15:57:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T376905)', diff saved to https://phabricator.wikimedia.org/P70520 and previous config saved to /var/cache/conftool/dbconfig/20241022-155759-ladsgroup.json [15:58:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [15:58:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [15:58:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T376905)', diff saved to https://phabricator.wikimedia.org/P70521 and previous config saved to /var/cache/conftool/dbconfig/20241022-155824-ladsgroup.json [15:59:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org [16:00:05] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:14] (03PS1) 10Ssingh: P:dns:auth: alert if a change was submitted but authdns-update was not run [puppet] - 10https://gerrit.wikimedia.org/r/1082241 [16:00:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70522 and previous config saved to /var/cache/conftool/dbconfig/20241022-160045-arnaudb.json [16:00:53] (03PS2) 10Gmodena: charts: airflow: alert only on task failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082163 (https://phabricator.wikimedia.org/T377745) [16:01:25] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4335/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [16:01:29] FIRING: [17x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:22] (03CR) 10Krinkle: Profiler: introduce metrics batching and centralize socket management (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [16:06:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T376905)', diff saved to https://phabricator.wikimedia.org/P70523 and previous config saved to /var/cache/conftool/dbconfig/20241022-160625-ladsgroup.json [16:08:20] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10251137 (10MSantos) LGTM. Approved. [16:08:23] (03PS2) 10Ssingh: P:dns:auth: alert if a change was submitted but authdns-update was not run [puppet] - 10https://gerrit.wikimedia.org/r/1082241 [16:08:32] (03CR) 10Ssingh: "Updated the notes URL to point to https://wikitech.wikimedia.org/wiki/DNS#authdns_update_run." [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [16:08:56] !log hnowlan@cumin1002 START - Cookbook sre.discovery.service-route depool sessionstore in eqiad: testing sessionstore mesh migration [16:09:47] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4336/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [16:14:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in eqiad: testing sessionstore mesh migration [16:14:09] (03CR) 10Ssingh: [C:03+1] "LGTM! We will need to do the Pybal restarts so let's plan for a tomorrow morning merge? (Can take care of those)" [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [16:15:39] (03PS3) 10Ssingh: P:dns:auth: alert if a change was submitted but authdns-update was not run [puppet] - 10https://gerrit.wikimedia.org/r/1082241 [16:15:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70524 and previous config saved to /var/cache/conftool/dbconfig/20241022-161552-arnaudb.json [16:15:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [16:15:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [16:15:58] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:16:04] (03CR) 10Ssingh: "No code change, fixed typo for DNS git repo" [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [16:16:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P70525 and previous config saved to /var/cache/conftool/dbconfig/20241022-161604-arnaudb.json [16:16:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4337/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [16:18:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P70526 and previous config saved to /var/cache/conftool/dbconfig/20241022-161816-arnaudb.json [16:20:09] (03CR) 10Vgutierrez: "sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [16:20:21] (03PS1) 10SBassett: Prevent blocked users from being able to review/unreview articles [extensions/PageTriage] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082243 (https://phabricator.wikimedia.org/T366991) [16:20:37] (03CR) 10Dzahn: [C:03+2] docker_registry_ha::registry: update gitlab-runner2002 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082208 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [16:20:38] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874 (10RobH) 03NEW [16:20:58] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10251221 (10RobH) [16:21:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P70527 and previous config saved to /var/cache/conftool/dbconfig/20241022-162132-ladsgroup.json [16:21:47] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10251225 (10RobH) a:03BTullis @btullis, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving... [16:22:32] (03PS1) 10Volans: remote: add dry_run getter for RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082244 [16:22:32] (03PS1) 10Volans: mysql: refactor this currently unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 [16:23:11] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10251224 (10Ottomata) > if/when we'll decide to move to Eventstreams Are you sure you want to move to EventSt... [16:27:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [16:27:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 (https://phabricator.wikimedia.org/T375102) (owner: 10Pppery) [16:30:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [16:31:00] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1176.eqiad.wmnet [16:33:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P70528 and previous config saved to /var/cache/conftool/dbconfig/20241022-163323-arnaudb.json [16:36:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P70529 and previous config saved to /var/cache/conftool/dbconfig/20241022-163639-ladsgroup.json [16:38:09] (03CR) 10Hnowlan: [C:03+2] sessionstore: use service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:39:09] (03Merged) 10jenkins-bot: sessionstore: use service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:44:39] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [16:44:43] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [16:45:44] (03PS3) 10Majavah: P:toolforge::proxy: use svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1080056 [16:46:28] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [16:46:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10251293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye executed... [16:46:44] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [16:47:00] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10251291 (10Dzahn) Groups are assigned to server roles. clouddumps servers have the role of `dumps distribution servers`, which is described as "nodes (that) pull data periodically from the Analytics had... [16:47:04] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [16:47:48] FIRING: [17x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P70530 and previous config saved to /var/cache/conftool/dbconfig/20241022-164830-arnaudb.json [16:50:21] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10251296 (10Dzahn) If you want to request membership in dumps-roots that would give access to all of this: ` role/common/dumps/generation/server/xmldumps.yaml: - dumps-roots role/common/dumps/generatio... [16:51:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T376905)', diff saved to https://phabricator.wikimedia.org/P70531 and previous config saved to /var/cache/conftool/dbconfig/20241022-165147-ladsgroup.json [16:51:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:52:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:52:07] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1176.eqiad.wmnet [16:52:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T376905)', diff saved to https://phabricator.wikimedia.org/P70532 and previous config saved to /var/cache/conftool/dbconfig/20241022-165211-ladsgroup.json [16:53:40] (03CR) 10Ottomata: [C:03+2] [analytics][refine] Use deduplication fix backport in legacy Refine job [puppet] - 10https://gerrit.wikimedia.org/r/1082218 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [16:54:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [16:55:46] (03PS1) 10Btullis: Add new kafka-jumbo nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082249 (https://phabricator.wikimedia.org/T377874) [16:56:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4338/console" [puppet] - 10https://gerrit.wikimedia.org/r/1082249 (https://phabricator.wikimedia.org/T377874) (owner: 10Btullis) [16:56:52] (03PS2) 10Btullis: Add new kafka-jumbo nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082249 (https://phabricator.wikimedia.org/T377874) [16:57:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4339/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082249 (https://phabricator.wikimedia.org/T377874) (owner: 10Btullis) [16:57:48] 10ops-eqsin, 06SRE: Inbound interface errors - asw1-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T376837#10251368 (10Dzahn) [16:58:35] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10251370 (10xcollazo) Thanks for the context @Dzahn. `dumps-roots` is indeed what we want for @gmodena. @Ottomata can you please approve? [16:58:36] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - msw1-eqiad.mgmt.eqiad.wmnet - https://phabricator.wikimedia.org/T376547#10251371 (10Dzahn) [16:59:24] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10251375 (10Ottomata) Approved [16:59:25] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878 (10RobH) 03NEW [16:59:55] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10251403 (10RobH) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1700) [17:00:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T376905)', diff saved to https://phabricator.wikimedia.org/P70533 and previous config saved to /var/cache/conftool/dbconfig/20241022-170008-ladsgroup.json [17:00:14] (03PS7) 10Ssingh: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [17:00:15] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Grant bd808 membership in the contint-roots and contint-docker groups - https://phabricator.wikimedia.org/T377792#10251358 (10Dzahn) This all looks ready to go except I think it needs approval from @Bmueller as the m... [17:00:20] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10251404 (10RobH) a:03BTullis Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new serve... [17:01:36] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [17:03:22] 06SRE, 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10251410 (10Dzahn) With the group approval done now, the other approval it needs is from the direct manager, which Betterworks lists as @Ahoelzl . [17:03:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P70534 and previous config saved to /var/cache/conftool/dbconfig/20241022-170337-arnaudb.json [17:03:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [17:03:43] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:03:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [17:04:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70535 and previous config saved to /var/cache/conftool/dbconfig/20241022-170400-arnaudb.json [17:04:14] !log disable Puppet on A:lvs to merge 1006063: T358260 [17:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:18] T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260 [17:04:49] !log hnowlan@cumin1002 START - Cookbook sre.discovery.service-route pool sessionstore in eqiad: repooling sessionstore post mesh migration T363996 [17:04:54] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [17:05:08] (03CR) 10Ssingh: [V:03+1 C:03+2] LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [17:09:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in eqiad: repooling sessionstore post mesh migration T363996 [17:09:58] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [17:14:35] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10251475 (10Dzahn) Fixing this should also resolve older ticket T351817 (alongside T365259). [17:14:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10251480 (10Dzahn) [17:14:55] !log re-enable Puppet on A:lvs [change merged on lvs2014]: T358260 [17:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:59] T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260 [17:15:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P70536 and previous config saved to /var/cache/conftool/dbconfig/20241022-171515-ladsgroup.json [17:16:46] (03PS1) 10Hnowlan: sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) [17:16:53] (03CR) 10CI reject: [V:04-1] sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:17:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs2014.codfw.wmnet with reason: rebooting to test changes rolled out in CR 1006063 [17:18:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2014.codfw.wmnet with reason: rebooting to test changes rolled out in CR 1006063 [17:18:08] (03PS2) 10Hnowlan: sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) [17:19:04] (03CR) 10CI reject: [V:04-1] sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:23:48] !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [17:26:21] 06SRE, 06Traffic, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#10251532 (10ops-monitoring-bot) Host rebooted by cmooney@cumin1002 with reason: Reboot host to apply new sysctls [17:26:21] (03PS3) 10Hnowlan: sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) [17:27:17] (03CR) 10Tchanders: [C:03+1] "Note that we'll also need a backport for Ic0bdb834cf082182d66260e62eeb534a8cf696b4" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [17:28:51] (03PS6) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) [17:30:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet [17:30:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P70537 and previous config saved to /var/cache/conftool/dbconfig/20241022-173022-ladsgroup.json [17:30:54] (03CR) 10JMeybohm: k8s.pool-depool-node: Add support for multiple nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [17:36:14] jouncebot now [17:36:14] For the next 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1700) [17:40:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/PageTriage] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082243 (https://phabricator.wikimedia.org/T366991) (owner: 10SBassett) [17:41:58] (03PS4) 10Hnowlan: sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) [17:45:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T376905)', diff saved to https://phabricator.wikimedia.org/P70538 and previous config saved to /var/cache/conftool/dbconfig/20241022-174530-ladsgroup.json [17:45:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:45:45] !log sudo cumin "A:cp-upload" 'disable-puppet "merging CR 1078994"': T375761 [17:45:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:53] T375761: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761 [17:45:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T376905)', diff saved to https://phabricator.wikimedia.org/P70539 and previous config saved to /var/cache/conftool/dbconfig/20241022-174555-ladsgroup.json [17:46:09] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:47:30] (03CR) 10Hnowlan: [C:03+2] sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:48:41] (03Merged) 10jenkins-bot: sessionstore: temporarily ensure that codfw does not use the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082253 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:48:47] (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: add pediapress.com to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) (owner: 10Ssingh) [17:49:15] !log dduvall@deploy2002 Started deploy [releng/jenkins-deploy@16eb792] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/90 [17:50:01] !log dduvall@deploy2002 Finished deploy [releng/jenkins-deploy@16eb792] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/90 (duration: 01m 21s) [17:54:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T376905)', diff saved to https://phabricator.wikimedia.org/P70540 and previous config saved to /var/cache/conftool/dbconfig/20241022-175409-ladsgroup.json [17:54:14] !log sudo cumin -b4 "A:cp-upload" 'run-puppet-agent --enable "merging CR 1078994"': T375761 [17:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:18] T375761: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761 [17:59:03] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Mailing list Delivery Mode set to None - https://phabricator.wikimedia.org/T368134#10251674 (10Dzahn) I am not aware of further reports like this. And I think if it had affected all lists we would have heard about it, while if it was limited to a s... [18:00:05] dancy and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1800). [18:00:40] Waiting for a backport to complete, then I'll roll the train to group0 [18:01:01] (03Merged) 10jenkins-bot: Prevent blocked users from being able to review/unreview articles [extensions/PageTriage] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082243 (https://phabricator.wikimedia.org/T366991) (owner: 10SBassett) [18:01:32] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1082243|Prevent blocked users from being able to review/unreview articles (T366991)]] [18:01:37] T366991: CVE-2024-47848: User can review/unreview articles while blocked - https://phabricator.wikimedia.org/T366991 [18:02:32] (03PS2) 10MacFan4000: ExtensionDistributor: Mark 1.43 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) [18:03:11] (03PS1) 10Ssingh: hiera: set profile::lvs::do_ipv6_ra_primary on lvs4010 [puppet] - 10https://gerrit.wikimedia.org/r/1082257 (https://phabricator.wikimedia.org/T358260) [18:04:04] !log dancy@deploy2002 dancy, sbassett: Backport for [[gerrit:1082243|Prevent blocked users from being able to review/unreview articles (T366991)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:04:06] (03PS2) 10Cwhite: Profiler: introduce metrics batching and centralize socket management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 [18:04:14] !log dancy@deploy2002 dancy, sbassett: Continuing with sync [18:04:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70541 and previous config saved to /var/cache/conftool/dbconfig/20241022-180426-arnaudb.json [18:04:32] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:04:33] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1082257 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [18:04:42] (03CR) 10Cwhite: Profiler: introduce metrics batching and centralize socket management (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [18:08:59] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082243|Prevent blocked users from being able to review/unreview articles (T366991)]] (duration: 07m 26s) [18:09:07] T366991: CVE-2024-47848: User can review/unreview articles while blocked - https://phabricator.wikimedia.org/T366991 [18:09:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P70542 and previous config saved to /var/cache/conftool/dbconfig/20241022-180916-ladsgroup.json [18:09:59] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082259 (https://phabricator.wikimedia.org/T375659) [18:10:01] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082259 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:10:15] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10251715 (10ssingh) 05Open→03Resolved a:03ssingh Change has been rolled out. Please re-open this task if there are any issues. Thanks! [18:10:31] (03CR) 10Cathal Mooney: [C:03+1] hiera: set profile::lvs::do_ipv6_ra_primary on lvs4010 [puppet] - 10https://gerrit.wikimedia.org/r/1082257 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [18:10:39] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082259 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:12:55] (03PS1) 10Dzahn: url_downloader: ferm::service -> firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1082260 [18:14:28] (03PS1) 10Dzahn: installserver: ferm::service -> firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1082261 [18:15:39] (03PS1) 10Dzahn: idp: ferm::service -> firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1082262 [18:17:31] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.28 refs T375659 [18:17:49] T375659: 1.43.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T375659 [18:19:33] (03CR) 10BCornwall: [C:03+2] varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:19:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70543 and previous config saved to /var/cache/conftool/dbconfig/20241022-181933-arnaudb.json [18:24:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P70544 and previous config saved to /var/cache/conftool/dbconfig/20241022-182423-ladsgroup.json [18:24:32] !log dancy@deploy2002 Started scap sync-world: Refreshing [18:26:06] !log dancy@deploy2002 sync-world aborted: Refreshing (duration: 01m 33s) [18:26:40] (03CR) 10Muehlenhoff: [C:04-1] "We don't have a NETWORK_INFRA set definition for nftables yet" [puppet] - 10https://gerrit.wikimedia.org/r/1082261 (owner: 10Dzahn) [18:27:58] (03PS4) 10Ssingh: P:dns:auth: alert if a change was submitted but authdns-update was not run [puppet] - 10https://gerrit.wikimedia.org/r/1082241 [18:31:40] 06SRE, 06SRE-OnFire, 06collaboration-services, 10Release-Engineering-Team (Radar), 07Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162#10251781 (10Dzahn) >>! In T309162#8877794, @hashar wrote: > This was long forgotten. The problem is when a `Scap::T... [18:34:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70546 and previous config saved to /var/cache/conftool/dbconfig/20241022-183440-arnaudb.json [18:37:40] (03Abandoned) 10Dzahn: installserver: ferm::service -> firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1082261 (owner: 10Dzahn) [18:37:48] (03Abandoned) 10Dzahn: idp: ferm::service -> firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1082262 (owner: 10Dzahn) [18:37:53] (03Abandoned) 10Dzahn: url_downloader: ferm::service -> firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1082260 (owner: 10Dzahn) [18:39:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T376905)', diff saved to https://phabricator.wikimedia.org/P70547 and previous config saved to /var/cache/conftool/dbconfig/20241022-183930-ladsgroup.json [18:39:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [18:39:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [18:39:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T376905)', diff saved to https://phabricator.wikimedia.org/P70548 and previous config saved to /var/cache/conftool/dbconfig/20241022-183955-ladsgroup.json [18:48:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T376905)', diff saved to https://phabricator.wikimedia.org/P70549 and previous config saved to /var/cache/conftool/dbconfig/20241022-184806-ladsgroup.json [18:48:33] (03PS1) 10Dzahn: gerrit: use 'gerrit' instead of 'gerrit2' as system user on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) [18:49:44] (03PS2) 10Dzahn: gerrit: use 'gerrit' instead of 'gerrit2' as system user on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) [18:49:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70550 and previous config saved to /var/cache/conftool/dbconfig/20241022-184946-arnaudb.json [18:49:52] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:53:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10251820 (10Papaul) @MatthewVernon sorry didn't get back with you on this disk issues the new ms-be nodes are having. If the operating system is not seein... [18:55:07] dancy: ok if i deploy the new scap release? [18:55:42] Go for it [18:56:01] !log dduvall@deploy2002 Installing scap version "4.116.0" for 209 hosts [19:00:12] !log dduvall@deploy2002 Installation of scap version "4.116.0" completed for 209 hosts [19:03:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P70551 and previous config saved to /var/cache/conftool/dbconfig/20241022-190313-ladsgroup.json [19:11:45] (03PS3) 10Dzahn: gerrit: use systemd::sysuser, reserved UID/GID, new name for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) [19:12:19] (03CR) 10CI reject: [V:04-1] gerrit: use systemd::sysuser, reserved UID/GID, new name for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:12:40] (03CR) 10Dzahn: "going to use this now on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082264/3/modules/gerrit/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) (owner: 10Hashar) [19:12:42] (03PS10) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) [19:18:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P70552 and previous config saved to /var/cache/conftool/dbconfig/20241022-191820-ladsgroup.json [19:21:18] jouncebot: nowandnext [19:21:19] For the next 0 hour(s) and 38 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T1800) [19:21:19] In 0 hour(s) and 38 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T2000) [19:22:07] dancy: dduvall: If you are done and you're okay with it, can I deploy this? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1081310 [19:22:48] Have at it. [19:28:18] awesome [19:30:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081310 (owner: 10Ebrahim) [19:30:46] (03Merged) 10jenkins-bot: Fix duplicated key in wgVectorNightMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081310 (owner: 10Ebrahim) [19:31:13] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1081310|Fix duplicated key in wgVectorNightMode]] [19:31:26] (03PS1) 10Dzahn: site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 [19:31:56] (03PS2) 10Dzahn: site/mx: move interface::alias out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1082266 [19:32:00] (03CR) 10CI reject: [V:04-1] site/mx: move interface::alias out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [19:32:30] (03CR) 10CI reject: [V:04-1] site/mx: move interface::alias out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [19:33:04] (03PS3) 10Dzahn: site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 [19:33:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T376905)', diff saved to https://phabricator.wikimedia.org/P70553 and previous config saved to /var/cache/conftool/dbconfig/20241022-193327-ladsgroup.json [19:33:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [19:33:38] (03CR) 10CI reject: [V:04-1] site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [19:33:44] !log ladsgroup@deploy2002 ladsgroup, ebrahim: Backport for [[gerrit:1081310|Fix duplicated key in wgVectorNightMode]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:33:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [19:33:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T376905)', diff saved to https://phabricator.wikimedia.org/P70554 and previous config saved to /var/cache/conftool/dbconfig/20241022-193352-ladsgroup.json [19:33:59] (03PS2) 10Gergő Tisza: Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) [19:33:59] (03CR) 10Gergő Tisza: "> The following files contain Git conflicts:" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [19:34:07] (03PS4) 10Dzahn: site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 [19:34:27] !log ladsgroup@deploy2002 ladsgroup, ebrahim: Continuing with sync [19:35:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [19:35:53] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1082264/4342/gerrit1003.wikimedia.org/change.gerrit1003.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:36:11] ladsgroup@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [19:39:05] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081310|Fix duplicated key in wgVectorNightMode]] (duration: 07m 51s) [19:40:31] !log disabling puppet on A:cp-text before merging ATS Lua changes - T372605 [19:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:37] (03PS4) 10Dzahn: gerrit: use systemd::sysuser, reserved UID/GID, new name for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) [19:40:55] (03CR) 10Scott French: [C:03+2] P:trafficserver: extend x-wikimedia-debug-routing for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1072638 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [19:41:35] T372605: Extend x-wikimedia-debug-routing.lua to support PHP 8.1 mw-debug deployment - https://phabricator.wikimedia.org/T372605 [19:41:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T376905)', diff saved to https://phabricator.wikimedia.org/P70555 and previous config saved to /var/cache/conftool/dbconfig/20241022-194156-ladsgroup.json [19:44:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.184s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:49:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.184s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:50:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160#10251989 (10Dzahn) This was an exim specific wording in the... [19:52:22] (03PS5) 10Dzahn: gerrit: use systemd::sysuser, reserved UID/GID, new name for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) [19:54:31] !log running puppet on A:cp-text (-b11) after validating ATS Lua changes on cp4040 - T372605 [19:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:12] T372605: Extend x-wikimedia-debug-routing.lua to support PHP 8.1 mw-debug deployment - https://phabricator.wikimedia.org/T372605 [19:56:14] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1082264/4344/" [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:57:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P70556 and previous config saved to /var/cache/conftool/dbconfig/20241022-195703-ladsgroup.json [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T2000) [20:00:05] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:04:30] o/ [20:05:54] (03CR) 10Gergő Tisza: "recheck" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:12:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P70557 and previous config saved to /var/cache/conftool/dbconfig/20241022-201210-ladsgroup.json [20:12:44] (03CR) 10CI reject: [V:04-1] Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:15:53] (03CR) 10Gergő Tisza: "recheck" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:16:05] I can help w/ backport deployments if needed [20:16:23] (03PS1) 10Gergő Tisza: Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082269 (https://phabricator.wikimedia.org/T341650) [20:16:48] thx, I can self-deploy if I manage to unbork CI in time [20:17:01] Good luck! [20:27:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T376905)', diff saved to https://phabricator.wikimedia.org/P70558 and previous config saved to /var/cache/conftool/dbconfig/20241022-202717-ladsgroup.json [20:27:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [20:27:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [20:32:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [20:32:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [20:47:09] (03CR) 10CI reject: [V:04-1] Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082269 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:51:00] (03CR) 10Gergő Tisza: "recheck" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082269 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:51:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [20:54:42] I'll reschedule to tomorrow. [21:00:01] !log dduvall@deploy2002 Started deploy [releng/jenkins-deploy@b08d130] (releasing): Deploying changes to single-version MediaWiki image build [21:00:42] !log dduvall@deploy2002 Finished deploy [releng/jenkins-deploy@b08d130] (releasing): Deploying changes to single-version MediaWiki image build (duration: 01m 44s) [21:10:56] (03PS1) 10Daimona Eaytoy: WikiProjectIDLookup: use SparqlClient and make endpoint configurable [extensions/WikimediaCampaignEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082277 (https://phabricator.wikimedia.org/T377746) [21:11:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaCampaignEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082277 (https://phabricator.wikimedia.org/T377746) (owner: 10Daimona Eaytoy) [21:17:50] (03PS1) 10Zabe: s1: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082278 (https://phabricator.wikimedia.org/T183490) [21:21:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160#10252214 (10jhathaway) @Dzahn, today I removed the aliases... [21:40:11] !log dancy@deploy2002 Installing scap version "4.117.0" for 209 hosts [21:44:21] !log dancy@deploy2002 Installation of scap version "4.117.0" completed for 209 hosts [21:44:57] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncmonitor1001.eqiad.wmnet [21:48:09] (03PS2) 10Dzahn: mariadb: update grants for phab2002 with new IP [puppet] - 10https://gerrit.wikimedia.org/r/1080781 (https://phabricator.wikimedia.org/T377374) [21:48:15] (03CR) 10Ladsgroup: [C:03+2] mariadb: update grants for phab2002 with new IP [puppet] - 10https://gerrit.wikimedia.org/r/1080781 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [21:48:18] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: update grants for phab2002 with new IP [puppet] - 10https://gerrit.wikimedia.org/r/1080781 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [21:48:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncmonitor1001.eqiad.wmnet [21:51:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T367856)', diff saved to https://phabricator.wikimedia.org/P70559 and previous config saved to /var/cache/conftool/dbconfig/20241022-215137-ladsgroup.json [21:52:12] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:02:00] jouncebot: nowandnext [22:02:00] No deployments scheduled for the next 7 hour(s) and 57 minute(s) [22:02:00] In 7 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T0600) [22:03:57] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1082278|s1: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [22:04:47] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:06:26] !log zabe@deploy2002 zabe: Backport for [[gerrit:1082278|s1: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:06:28] !log zabe@deploy2002 zabe: Continuing with sync [22:07:05] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383#10252269 (10TAndic) 05Resolved→03Open Hello SRE! I have been issued a new computer as part of a transition to req holder and need to update my SSH key for access. SS... [22:07:09] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10252272 (10Eevans) This is fascinating; So a `VACUUM` shrunk the database size by .8G... //but 3.9G of blocks//? Is that because the... [22:08:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P70560 and previous config saved to /var/cache/conftool/dbconfig/20241022-220847-ladsgroup.json [22:11:14] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082278|s1: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 07m 17s) [22:12:11] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:20:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10252289 (10Jclark-ctr) @cmooney Step 1: Firewall Installation & Cabling is complete Since we have racked the new switches a... [22:21:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10252290 (10Jclark-ctr) [22:23:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P70561 and previous config saved to /var/cache/conftool/dbconfig/20241022-222352-ladsgroup.json [22:38:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P70562 and previous config saved to /var/cache/conftool/dbconfig/20241022-223858-ladsgroup.json [22:57:14] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896 (10RobH) 03NEW [22:58:17] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10252347 (10RobH) a:03Eevans @eevans, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the... [22:58:38] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10252382 (10RobH) [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082300 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082300 (owner: 10TrainBranchBot) [23:51:38] (03PS1) 10Eevans: restbase203[6-8]: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1082301 (https://phabricator.wikimedia.org/T377896) [23:51:39] (03PS1) 10Eevans: Configure restbase10[34-42] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1082302 (https://phabricator.wikimedia.org/T354227) [23:59:51] (03PS2) 10Eevans: restbase203[6-8]: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1082301 (https://phabricator.wikimedia.org/T377896) [23:59:51] (03PS2) 10Eevans: Configure restbase10[34-42] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1082302 (https://phabricator.wikimedia.org/T354227)