[00:05:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:02] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:32:03] (03PS2) 10Andrea Denisse: titan: Bring thanos raw retention to 44w [puppet] - 10https://gerrit.wikimedia.org/r/1088390 (https://phabricator.wikimedia.org/T351927) [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088391 [00:38:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088391 (owner: 10TrainBranchBot) [00:45:34] (03CR) 10Cwhite: [C:03+1] titan: Bring thanos raw retention to 44w [puppet] - 10https://gerrit.wikimedia.org/r/1088390 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [00:50:01] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10302537 (10colewhite) ` root@logging-hd1005:~$ ipmitool lan print 1 Set in Progress : Set Complete Auth Type Support : NONE MD2 MD5 PASSWORD Auth Type Enable : Callba... [00:55:39] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:58:42] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T379337 (10phaultfinder) 03NEW [01:02:41] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:08:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088393 [01:08:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088393 (owner: 10TrainBranchBot) [01:10:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088391 (owner: 10TrainBranchBot) [01:27:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10302579 (10Jhancock.wm) [01:28:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10302581 (10Jhancock.wm) [01:29:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10302583 (10Jhancock.wm) [01:31:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10302584 (10Jhancock.wm) need to check 2149 in the morning. nic port not up. [01:39:41] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088393 (owner: 10TrainBranchBot) [01:47:02] (03CR) 10Zabe: Reopen testcommonswiki for testing Chart extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [01:55:41] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:04:41] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:22:37] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:42] (03CR) 10Seddon: Reopen testcommonswiki for testing Chart extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:07:02] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:41] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10302721 (10Risker) >>! In T377045#10302490, @FastLizard4 wrote: > Looks like this has just happened on Wikimedia-l. Here's a link to the archiv... [06:26:23] (03PS1) 10Stevemunene: airflow: add wmde namespace [puppet] - 10https://gerrit.wikimedia.org/r/1088404 (https://phabricator.wikimedia.org/T378438) [06:46:33] (03PS1) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [06:47:30] (03CR) 10CI reject: [V:04-1] airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [06:52:15] (03CR) 10Brouberol: [C:03+1] airflow: add wmde namespace [puppet] - 10https://gerrit.wikimedia.org/r/1088404 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [06:57:31] (03PS1) 10Stevemunene: idp:Add aiflow-wmde to idp [puppet] - 10https://gerrit.wikimedia.org/r/1088407 (https://phabricator.wikimedia.org/T378438) [06:59:04] (03PS2) 10Stevemunene: idp:Add airflow-wmde to idp [puppet] - 10https://gerrit.wikimedia.org/r/1088407 (https://phabricator.wikimedia.org/T378438) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241108T0700) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:32:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088407 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [07:33:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1047.eqiad.wmnet to cluster eqiad and group C [07:33:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1047.eqiad.wmnet to cluster eqiad and group C [07:33:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [07:39:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [07:39:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet [07:45:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet [07:46:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1047.eqiad.wmnet to cluster eqiad and group C [07:48:05] !log manually install/test gnmic 0.39 on netflow6001 [07:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1047.eqiad.wmnet to cluster eqiad and group C [07:48:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10302825 (10MoritzMuehlenhoff) [07:59:28] (03PS1) 10KCVelaga: Update stream registration and config for MinT for Readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241108T0800) [08:05:53] !log add gnmic 0.39 from official git repo to bookworm reprepro - T347461 [08:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:01] T347461: Build and package gnmic - https://phabricator.wikimedia.org/T347461 [08:07:02] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:59] !log update gnmic to 0.39 on all netflow hosts [08:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:49] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp enable Redis TGT for all of production. [puppet] - 10https://gerrit.wikimedia.org/r/1087913 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [08:20:11] (03CR) 10Slyngshede: [C:03+2] Permission UI: Minor tweaks to permission approval UI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088286 (owner: 10Slyngshede) [08:22:03] (03PS1) 10Muehlenhoff: Remove mw_rc_irc role from irc1002/2002 for decom of the legacy service [puppet] - 10https://gerrit.wikimedia.org/r/1088482 (https://phabricator.wikimedia.org/T376014) [08:22:36] (03Merged) 10jenkins-bot: Permission UI: Minor tweaks to permission approval UI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088286 (owner: 10Slyngshede) [08:26:40] !log upgraded ircstream on irc.wikimedia.org to 1.0.1 [08:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:02] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10302843 (10MoritzMuehlenhoff) One more update: The upstream author (Faidon) of ircstream fixed the underlying bug in https://githu... [08:27:24] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:27:43] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:27:53] (03CR) 10Brouberol: [C:03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/1088407 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [08:29:02] (03CR) 10Brouberol: [C:03+2] airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [08:29:18] (03CR) 10Brouberol: [C:03+2] airflow: render the spark/hadoop/hdfs/yarn configuration files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [08:29:23] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [08:30:16] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqsin [08:30:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10302865 (10elukey) @jhathaway thanks a ton for the tests, it was exactly what I had in mind to do today :) > The only notable piece was that when swit... [08:30:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqsin [08:31:07] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2085.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:33:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:34:40] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad [08:34:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:34:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad [08:35:20] (03PS2) 10Brouberol: airflow: create the kerberos token PVC even if kerberos is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087459 (https://phabricator.wikimedia.org/T375875) [08:35:38] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device ssw1-e1-eqiad [08:35:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-eqiad [08:35:47] (03CR) 10Stevemunene: [C:03+2] idp:Add airflow-wmde to idp [puppet] - 10https://gerrit.wikimedia.org/r/1088407 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [08:39:00] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 175230 MB (4% inode=92%): /srv/swift-storage/sdc1 181925 MB (4% inode=91%): /srv/swift-storage/sdf1 228034 MB (5% inode=91%): /srv/swift-storage/sdg1 198650 MB (5% inode=91%): /srv/swift-storage/sdd1 193135 MB (5% inode=91%): /srv/swift-storage/sde1 189408 MB (4% inode=92%): /srv/swift-storage/sdi1 181830 MB (4% inode=91%): /srv/swift-st [08:39:00] k1 172738 MB (4% inode=92%): /srv/swift-storage/sdj1 186190 MB (4% inode=91%): /srv/swift-storage/sdl1 176438 MB (4% inode=91%): /srv/swift-storage/sdm1 186891 MB (4% inode=91%): /srv/swift-storage/sdn1 152395 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [08:39:29] (03CR) 10Elukey: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1088482 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [08:39:39] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device ssw1-f1-eqiad [08:39:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-f1-eqiad [08:40:52] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a2-codfw [08:40:59] 06SRE, 06Infrastructure-Foundations: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343 (10MoritzMuehlenhoff) 03NEW [08:41:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a2-codfw [08:41:19] (03CR) 10Muehlenhoff: [C:03+2] Remove mw_rc_irc role from irc1002/2002 for decom of the legacy service [puppet] - 10https://gerrit.wikimedia.org/r/1088482 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [08:41:29] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2085.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:41:34] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a3-codfw [08:41:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a3-codfw [08:41:55] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a4-codfw [08:42:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a4-codfw [08:42:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1048.eqiad.wmnet to cluster eqiad and group C [08:42:42] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a5-codfw [08:42:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a5-codfw [08:43:11] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a6-codfw [08:43:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a6-codfw [08:43:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1048.eqiad.wmnet to cluster eqiad and group C [08:43:34] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a7-codfw [08:43:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a7-codfw [08:43:57] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-a8-codfw [08:44:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a8-codfw [08:47:32] (03PS1) 10Ilias Sarantopoulos: ml-services: add deprecation messsage to ores-legacy ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088505 [08:51:12] PROBLEM - ircecho bot process on irc1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:51:32] FIRING: UdpMxIrcEchoThroughput: irc2002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [08:52:02] FIRING: [4x] ProbeDown: Service irc1002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:52:10] PROBLEM - ircecho bot process on irc2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:52:17] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b2-codfw [08:52:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b2-codfw [08:52:40] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b3-codfw [08:52:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b3-codfw [08:52:53] ^ircecho is expected, it's being decommed and should recover soon [08:52:56] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b4-codfw [08:53:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b4-codfw [08:53:25] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b5-codfw [08:53:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b5-codfw [08:54:08] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b6-codfw [08:54:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b6-codfw [08:55:42] FIRING: [2x] JobUnavailable: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:56:32] RESOLVED: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [08:56:49] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2085.codfw.wmnet with OS bullseye [08:57:15] (03PS6) 10Slyngshede: P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) [08:58:43] (03CR) 10Kevin Bazira: [C:03+1] ml-services: add deprecation messsage to ores-legacy ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088505 (owner: 10Ilias Sarantopoulos) [09:01:01] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b7-codfw [09:01:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b7-codfw [09:01:15] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-b8-codfw [09:01:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b8-codfw [09:03:09] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device ssw1-a1-codfw [09:03:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a1-codfw [09:03:41] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device ssw1-a8-codfw [09:03:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a8-codfw [09:09:19] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2085.codfw.wmnet with OS bullseye [09:09:37] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2085.codfw.wmnet with OS bullseye [09:17:02] FIRING: [3x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:07] RESOLVED: [4x] ProbeDown: Service irc1002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:35] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [09:20:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:22:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10302965 (10elukey) I tried with ms-be2085, doing the following: * Provision to UEFI, manual/extra chassis reset triggered via spicerack-shell. * Verify... [09:24:19] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [09:24:34] (03CR) 10Ayounsi: "one small comment but overall lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [09:24:37] (03CR) 10Ayounsi: [C:03+1] Add puppet entries for new fundraising switches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [09:29:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on an-presto1018.eqiad.wmnet with reason: Downtimed for further troubleshooting possible Hardware failure [09:29:38] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on an-presto1018.eqiad.wmnet with reason: Downtimed for further troubleshooting possible Hardware failure [09:29:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10302971 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c79896bf-7b1a-4996-a194-5ddd94c51f42) set by stevemunene@cumin1002 fo... [09:33:44] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: apply more overrides after d-i for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088324 (owner: 10Elukey) [09:38:55] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2086.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:39:50] !log elukey@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [09:41:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [09:41:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2085.codfw.wmnet with OS bullseye [09:43:41] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [09:48:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10303005 (10Gehel) [09:48:43] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [09:49:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2086.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:49:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10303003 (10Gehel) [09:49:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10303007 (10Gehel) [09:50:43] (03CR) 10Santiago Faci: [C:03+1] Update stream registration and config for MinT for Readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga) [09:54:53] (03PS2) 10Gmodena: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) [09:57:52] !log testing account creation backfill script on mwmaint2001 in screen session as ariel [09:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:57] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [10:02:04] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [10:02:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10303145 (10elukey) Very interesting - I watched the sol1 console of ms-be2086 when doing provisioning, and right after the second round of reboot (for BI... [10:03:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10303169 (10MatthewVernon) [10:04:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10303176 (10elukey) This is the boot order right after provisioning: ` 'BootModeSelect': 'UEFI',... [10:04:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10303077 (10Gehel) [10:04:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10303183 (10MatthewVernon) [10:10:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10303187 (10MatthewVernon) Hi @Khantstop, I think you're a contractor - can you or @OSefu-WMF confirm the contra... [10:11:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4478/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [10:12:02] FIRING: [3x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:12:22] (03PS1) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1088522 [10:16:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [10:16:58] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [10:17:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10303230 (10ops-monitoring-bot) Draining ganeti1011.eqiad.wmnet of running VMs [10:18:56] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2086.codfw.wmnet with OS bullseye [10:19:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1011.eqiad.wmnet [10:23:18] (03CR) 10Nik Gkountas: [C:03+1] Update stream registration and config for MinT for Readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga) [10:26:59] (03PS2) 10Cathal Mooney: Add puppet entries for new fundraising switches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) [10:27:14] (03CR) 10Cathal Mooney: Add puppet entries for new fundraising switches in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [10:28:14] (03PS1) 10Elukey: [TEST] sre.hosts.reimage: enable/disable PXE over HTTP for UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1088524 [10:29:41] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [10:31:40] (03PS1) 10Muehlenhoff: Deprecate system::role for dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/1088526 [10:33:40] 06SRE, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349 (10MoritzMuehlenhoff) 03NEW [10:34:37] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2086.codfw.wmnet with OS bullseye [10:38:09] (03PS2) 10Elukey: [TEST] sre.hosts.reimage: enable/disable PXE over HTTP for UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1088524 [10:38:38] (03CR) 10Mvolz: [C:04-1] zotero: Switch image from gerrit- to GitLab-hosted (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088330 (https://phabricator.wikimedia.org/T374558) (owner: 10Jforrester) [10:39:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:42:27] 06SRE, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349#10303318 (10MoritzMuehlenhoff) [10:42:46] (03PS3) 10Gmodena: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) [10:44:11] (03CR) 10Ayounsi: [C:03+1] Add puppet entries for new fundraising switches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [10:44:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] interface::rule: Add missing -6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1088357 (owner: 10Majavah) [10:45:28] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [10:45:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2015.codfw.wmnet [10:48:20] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [10:51:08] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:55:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10303351 (10MoritzMuehlenhoff) [10:55:26] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2015.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:56:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2015.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:56:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2015.codfw.wmnet [10:56:08] 06SRE, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349#10303358 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2015.codfw.wmnet` - ganeti2015.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertman... [10:56:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2016.codfw.wmnet [10:57:29] (03PS1) 10Cathal Mooney: Move idle-timeout under login to the dedicated login template [homer/public] - 10https://gerrit.wikimedia.org/r/1088535 (https://phabricator.wikimedia.org/T377381) [10:57:46] (03CR) 10Cathal Mooney: [C:03+2] Add puppet entries for new fundraising switches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [10:58:20] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2086.codfw.wmnet with OS bullseye [11:00:23] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [11:02:50] (03Abandoned) 10Clément Goubert: external_clouds_vendors: Use requestctl apply [puppet] - 10https://gerrit.wikimedia.org/r/1088274 (owner: 10Clément Goubert) [11:04:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:04:28] (03PS1) 10Cathal Mooney: Remove pfw3-eqiad and replace with pfw1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088537 (https://phabricator.wikimedia.org/T377381) [11:05:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:07:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:12:18] (03PS1) 10Arturo Borrero Gonzalez: prometheus-node-kernel-panic: scan last 60d worth of messages [puppet] - 10https://gerrit.wikimedia.org/r/1088539 [11:13:01] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2086.codfw.wmnet with OS bullseye [11:13:25] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2086.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:13:52] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2086.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:16:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:17:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:23:13] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10303402 (10Marostegui) Thanks Willy! [11:24:32] 06SRE, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349#10303403 (10MoritzMuehlenhoff) [11:24:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2016.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:25:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2016.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:25:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2016.codfw.wmnet [11:25:42] 06SRE, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349#10303404 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti2016.codfw.wmnet` - ganeti2016.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertman... [11:26:18] (03PS1) 10Brouberol: airflow: remove fsGroup stanzas as all containers are running with the same uid/gid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088543 (https://phabricator.wikimedia.org/T379265) [11:27:11] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2087.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:29:36] (03PS35) 10Marostegui: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [11:36:25] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2015/ganeti2016 - https://phabricator.wikimedia.org/T379349#10303417 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [11:37:28] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [11:37:35] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2087.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:38:17] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10303421 (10elukey) Another test, leading to weird results. I tried to do the following: * Manually disable `IPV4HTTPSupport` via spicerack shell, to bas... [11:40:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10303422 (10elukey) So far I provisioned up to ms-be2087, and ms-be2088 was left untouched. The ADMIN/root password should already be set to the one on pw... [11:48:07] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2086.codfw.wmnet with reason: host reimage [11:51:26] (03CR) 10Mvolz: [C:04-1] "https://gitlab.wikimedia.org/repos/mediawiki/services/zotero/-/merge_requests/3 to sync, then needs re-publish" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088330 (https://phabricator.wikimedia.org/T374558) (owner: 10Jforrester) [11:51:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2086.codfw.wmnet with reason: host reimage [11:53:22] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2087.codfw.wmnet with OS bullseye [11:59:36] !log testing of account creation backfill script on mwmaint2001 complete for the moment [11:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241108T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241108T1200). [12:00:48] (03PS6) 10Elukey: profile::trafficserver::backend: move kartotherian to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1087423 (https://phabricator.wikimedia.org/T378944) [12:01:49] (03CR) 10Elukey: Create new lvs service kartotherian-k8s-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [12:02:25] (03PS8) 10Elukey: Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [12:03:58] (03PS1) 10Btullis: [wikireplicas] Redact the abuse_filter_action table with a custom view [puppet] - 10https://gerrit.wikimedia.org/r/1088550 (https://phabricator.wikimedia.org/T378671) [12:04:44] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2087.codfw.wmnet with OS bullseye [12:05:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10303460 (10elukey) I also tried to not configure any special JBOD config for ms-be2087 after provision, and kick off reimage to see if the double d-i iss... [12:07:30] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [12:09:44] (03PS1) 10ArielGlenn: systemd job to create missing local accounts on loginwiki/metawiki [puppet] - 10https://gerrit.wikimedia.org/r/1088552 (https://phabricator.wikimedia.org/T378401) [12:09:53] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump version to match package version of latest sec release for Java 8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1088267 (owner: 10Muehlenhoff) [12:12:17] (03CR) 10FNegri: prometheus-node-kernel-panic: scan last 60d worth of messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088539 (owner: 10Arturo Borrero Gonzalez) [12:15:06] (03CR) 10Dreamy Jazz: [C:04-1] "Logged-out users can see the actions of public filters. For example, load https://en.wikipedia.org/wiki/Special:AbuseFilter/3 while logged" [puppet] - 10https://gerrit.wikimedia.org/r/1088550 (https://phabricator.wikimedia.org/T378671) (owner: 10Btullis) [12:16:21] (03CR) 10Dreamy Jazz: [C:04-1] [wikireplicas] Redact the abuse_filter_action table with a custom view (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088550 (https://phabricator.wikimedia.org/T378671) (owner: 10Btullis) [12:20:30] (03PS2) 10Btullis: [wikireplicas] Redact the abuse_filter_action table with a custom view [puppet] - 10https://gerrit.wikimedia.org/r/1088550 (https://phabricator.wikimedia.org/T378671) [12:20:59] (03CR) 10Btullis: [wikireplicas] Redact the abuse_filter_action table with a custom view (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088550 (https://phabricator.wikimedia.org/T378671) (owner: 10Btullis) [12:21:54] (03PS1) 10Hnowlan: jobqueue: restore webVideoTranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088553 [12:22:13] (03CR) 10Arturo Borrero Gonzalez: prometheus-node-kernel-panic: scan last 60d worth of messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088539 (owner: 10Arturo Borrero Gonzalez) [12:23:47] (03CR) 10Kamila Součková: [C:03+1] jobqueue: restore webVideoTranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088553 (owner: 10Hnowlan) [12:24:00] (03CR) 10Clément Goubert: [C:03+1] jobqueue: restore webVideoTranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088553 (owner: 10Hnowlan) [12:24:20] (03CR) 10Alexandros Kosiaris: [C:03+1] jobqueue: restore webVideoTranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088553 (owner: 10Hnowlan) [12:26:01] (03CR) 10Hnowlan: [C:03+2] jobqueue: restore webVideoTranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088553 (owner: 10Hnowlan) [12:27:04] (03Merged) 10jenkins-bot: jobqueue: restore webVideoTranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088553 (owner: 10Hnowlan) [12:28:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:29:42] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:30:02] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:30:13] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:30:54] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:32:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:43:24] (03PS1) 10Muehlenhoff: zuul-merger: Add support for configuring nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1088558 [12:44:00] (03CR) 10CI reject: [V:04-1] zuul-merger: Add support for configuring nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1088558 (owner: 10Muehlenhoff) [12:49:19] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1087951 (owner: 10Majavah) [13:01:26] (03PS2) 10Muehlenhoff: zuul-merger: Add support for configuring nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1088558 [13:03:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088558 (owner: 10Muehlenhoff) [13:09:40] (03PS3) 10Muehlenhoff: zuul-merger: Add support for configuring nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1088558 [13:17:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088558 (owner: 10Muehlenhoff) [13:21:18] (03CR) 10Stevemunene: [C:03+2] airflow: add wmde namespace [puppet] - 10https://gerrit.wikimedia.org/r/1088404 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [13:21:49] (03PS2) 10Stevemunene: airflow: add airflow-wmde files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) [13:22:35] (03PS2) 10Btullis: Enable the performace CPU governor on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) [13:24:13] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) (owner: 10Btullis) [13:24:34] (03CR) 10Brouberol: [C:03+2] airflow: create the kerberos token PVC even if kerberos is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087459 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [13:27:31] (03PS1) 10Brouberol: airflow: upgrade docker image to run airflow with python 3.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088561 (https://phabricator.wikimedia.org/T379266) [13:28:56] (03CR) 10Brouberol: [C:03+2] airflow: upgrade docker image to run airflow with python 3.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088561 (https://phabricator.wikimedia.org/T379266) (owner: 10Brouberol) [13:29:35] (03CR) 10Ayounsi: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1088537 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [13:29:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:30:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:31:01] (03CR) 10Marostegui: "I second this idea, that's probably a good way to unify things." [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [13:31:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:31:28] (03CR) 10Ayounsi: [C:03+1] Move idle-timeout under login to the dedicated login template [homer/public] - 10https://gerrit.wikimedia.org/r/1088535 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [13:39:03] (03PS1) 10Muehlenhoff: Switch zuul-merger to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1088562 [13:40:25] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: add deprecation messsage to ores-legacy ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088505 (owner: 10Ilias Sarantopoulos) [13:41:25] (03Merged) 10jenkins-bot: ml-services: add deprecation messsage to ores-legacy ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088505 (owner: 10Ilias Sarantopoulos) [13:44:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10303714 (10cmooney) @Jgreen @Dwisehaupt I was doing some prep work on T377996 - looking at step 1 to import the existing data... [13:46:58] (03PS1) 10Brouberol: airflow: define the kerberos keytab secret at all times [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088564 [13:50:13] (03CR) 10Brouberol: [C:03+2] airflow: define the kerberos keytab secret at all times [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088564 (owner: 10Brouberol) [14:02:52] (03PS1) 10Brouberol: airflow: introduce analytics-test specific overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088566 (https://phabricator.wikimedia.org/T379363) [14:03:22] (03PS2) 10Brouberol: airflow: introduce analytics-test specific overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088566 (https://phabricator.wikimedia.org/T379363) [14:04:54] (03PS3) 10Brouberol: airflow: introduce analytics-test specific overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088566 (https://phabricator.wikimedia.org/T379363) [14:05:22] (03PS4) 10Brouberol: airflow: introduce analytics-test specific overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088566 (https://phabricator.wikimedia.org/T379363) [14:05:56] (03CR) 10Ayounsi: "lgtm! To be deployed to Netbox-next first seeing how many files are being touched." [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [14:06:00] (03CR) 10Ayounsi: [C:03+1] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [14:47:35] (03CR) 10JHathaway: [C:03+1] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [14:52:57] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2087.codfw.wmnet with reason: host reimage [14:55:32] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:56:08] (03CR) 10FNegri: prometheus-node-kernel-panic: scan last 60d worth of messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088539 (owner: 10Arturo Borrero Gonzalez) [14:56:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2087.codfw.wmnet with reason: host reimage [14:56:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2128.codfw.wmnet with reason: host reimage [14:56:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2129.codfw.wmnet with reason: host reimage [14:56:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2137.codfw.wmnet with reason: host reimage [14:57:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2136.codfw.wmnet with reason: host reimage [14:57:18] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2138.codfw.wmnet with reason: host reimage [15:00:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2128.codfw.wmnet with reason: host reimage [15:00:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2140.codfw.wmnet with reason: host reimage [15:00:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2141.codfw.wmnet with reason: host reimage [15:01:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2143.codfw.wmnet with reason: host reimage [15:01:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2142.codfw.wmnet with reason: host reimage [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2137.codfw.wmnet with reason: host reimage [15:05:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:06:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2138.codfw.wmnet with reason: host reimage [15:07:58] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088566 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [15:08:15] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [15:09:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2140.codfw.wmnet with reason: host reimage [15:13:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2129.codfw.wmnet with reason: host reimage [15:15:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10304205 (10cmooney) I'd a chat with @Jgreen on irc about the above and he confirmed all those hosts are decommed. We're a lit... [15:15:11] (03CR) 10Brouberol: [C:03+2] airflow: introduce analytics-test specific overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088566 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [15:15:45] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [15:15:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2136.codfw.wmnet with reason: host reimage [15:16:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [15:16:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2087.codfw.wmnet with OS bullseye [15:16:43] (03CR) 10Brouberol: airflow: add airflow-wmde files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088405 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [15:18:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:19:17] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [15:20:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2141.codfw.wmnet with reason: host reimage [15:20:36] 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10304224 (10herron) >>! In T378989#10301395, @herron wrote: > ` > ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet > @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >... [15:21:04] (03PS1) 10Herron: reimage: don't check is_uefi on VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) [15:21:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:21:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2128.codfw.wmnet with OS bookworm [15:21:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10304255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2128.codfw.wmnet with OS bookworm completed: - wi... [15:22:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:22:18] (03PS1) 10Brouberol: airflow-analytics-test: fix typo in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088584 (https://phabricator.wikimedia.org/T379363) [15:23:31] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: fix typo in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088584 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [15:23:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2143.codfw.wmnet with reason: host reimage [15:24:01] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: fix typo in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088584 (https://phabricator.wikimedia.org/T379363) (owner: 10Brouberol) [15:24:40] (03PS1) 10FNegri: team-wmcs: aggregate kernel alerts over 24h [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) [15:25:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2136.codfw.wmnet with OS bookworm [15:26:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2136.codfw.wmnet with OS bo... [15:26:13] (03CR) 10CI reject: [V:04-1] team-wmcs: aggregate kernel alerts over 24h [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [15:27:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2142.codfw.wmnet with reason: host reimage [15:27:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:27:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2137.codfw.wmnet with OS bookworm [15:27:27] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:27:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2137.codfw.wmnet with OS bo... [15:28:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:28:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:28:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2138.codfw.wmnet with OS bookworm [15:28:14] (03CR) 10Elukey: "Thanks a lot and sorry for the issue, my bad!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:28:29] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2138.codfw.wmnet with OS bo... [15:28:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:28:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:28:47] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2140.codfw.wmnet with OS bo... [15:30:08] (03PS2) 10Herron: reimage: don't check is_uefi on VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) [15:31:04] (03PS2) 10FNegri: team-wmcs: aggregate kernel alerts over 24h [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) [15:31:36] (03CR) 10Elukey: [C:03+1] reimage: don't check is_uefi on VMs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:31:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:32:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:32:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2129.codfw.wmnet with OS bookworm [15:32:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10304346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2129.codfw.wmnet with OS bookworm completed: - wi... [15:32:37] (03CR) 10Herron: "no worries! thanks for the super quick review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:39:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:39:19] (03CR) 10Herron: [C:03+2] reimage: don't check is_uefi on VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:40:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:40:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2141.codfw.wmnet with OS bookworm [15:40:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2141.codfw.wmnet with OS bo... [15:40:28] (03PS1) 10Seddon: Reviving "Update interwiki map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088586 [15:41:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10304364 (10Jhancock.wm) [15:42:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10304371 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert This set is done! [15:43:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:44:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304385 (10Jhancock.wm) [15:45:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:45:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2143.codfw.wmnet with OS bookworm [15:45:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2143.codfw.wmnet with OS bo... [15:45:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:45:37] (03Merged) 10jenkins-bot: reimage: don't check is_uefi on VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1088583 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:46:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:46:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2142.codfw.wmnet with OS bookworm [15:46:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2142.codfw.wmnet with OS bo... [15:47:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304407 (10Jhancock.wm) [15:47:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10304409 (10Clement_Goubert) Thanks! [15:48:38] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [15:52:18] (03Abandoned) 10Elukey: [TEST] sre.hosts.reimage: enable/disable PXE over HTTP for UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1088524 (owner: 10Elukey) [15:53:47] (03PS1) 10Elukey: TEST: sre.hosts.reimage: use UEFIBootNext for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088590 [15:55:08] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [15:55:25] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [16:02:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2139.codfw.wmnet with OS bookworm [16:02:33] (03CR) 10Dzahn: [C:03+1] zuul-merger: Add support for configuring nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1088558 (owner: 10Muehlenhoff) [16:02:34] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1004.eqiad.wmnet with reason: host reimage [16:02:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2139.codfw.wmnet with OS bo... [16:05:21] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1004.eqiad.wmnet with reason: host reimage [16:09:06] (03PS3) 10Seddon: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [16:10:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:10:55] (03CR) 10CI reject: [V:04-1] Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [16:13:19] (03PS4) 10Seddon: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [16:16:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2136.codfw.wmnet with OS bookworm [16:17:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304498 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2136.codfw.wmnet with O... [16:17:53] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1088558/4481/" [puppet] - 10https://gerrit.wikimedia.org/r/1088558 (owner: 10Muehlenhoff) [16:20:21] (03CR) 10Xcollazo: [C:03+1] Deprecate system::role for dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/1088526 (owner: 10Muehlenhoff) [16:22:42] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [16:24:15] (03PS2) 10Elukey: TEST: sre.hosts.reimage: improve UEFI for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088590 [16:24:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop confirmed on contint*" [puppet] - 10https://gerrit.wikimedia.org/r/1088558 (owner: 10Muehlenhoff) [16:25:00] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:29:58] (03CR) 10Zabe: Reopen testcommonswiki for testing Chart extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [16:31:07] (03PS5) 10Seddon: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [16:33:00] (03CR) 10Dzahn: [V:03+1 C:03+1] "lgtm! https://puppet-compiler.wmflabs.org/output/1088562/4482/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1088562 (owner: 10Muehlenhoff) [16:35:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2136.codfw.wmnet with reason: host reimage [16:35:39] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:39:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2136.codfw.wmnet with reason: host reimage [16:42:21] (03CR) 10Dzahn: [V:03+1 C:03+2] Switch zuul-merger to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1088562 (owner: 10Muehlenhoff) [16:43:05] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:46:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "same rules, just /etc/ferm/conf.d/10_git-daemon_internal was removed, /etc/ferm/conf.d/10_git-daemon_internal_hosts and /etc/ferm/conf.d/1" [puppet] - 10https://gerrit.wikimedia.org/r/1088562 (owner: 10Muehlenhoff) [16:49:03] (03PS1) 10Brouberol: analytics_test_cluster: enable egress from the dse kubepods network [puppet] - 10https://gerrit.wikimedia.org/r/1088596 (https://phabricator.wikimedia.org/T377602) [16:49:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2144.codfw.wmnet with OS bookworm [16:49:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2145.codfw.wmnet with OS bookworm [16:49:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2156.codfw.wmnet with OS bookworm [16:49:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2157.codfw.wmnet with OS bookworm [16:49:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2158.codfw.wmnet with OS bookworm [16:49:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2159.codfw.wmnet with OS bookworm [16:49:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2160.codfw.wmnet with OS bookworm [16:49:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2161.codfw.wmnet with OS bookworm [16:49:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2162.codfw.wmnet with OS bookworm [16:49:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2145.codfw.wmnet with O... [16:49:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2144.codfw.wmnet with O... [16:49:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304679 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2156.codfw.wmnet with OS bookworm [16:49:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2157.codfw.wmnet with OS bookworm [16:49:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304681 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2158.codfw.wmnet with OS bookworm [16:49:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2159.codfw.wmnet with OS bookworm [16:50:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2160.codfw.wmnet with OS bookworm [16:50:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2161.codfw.wmnet with OS bookworm [16:50:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2162.codfw.wmnet with OS bookworm [16:50:33] (03CR) 10Dzahn: [C:03+2] zuul-merger: Remove now obsolete variable [puppet] - 10https://gerrit.wikimedia.org/r/1088572 (owner: 10Muehlenhoff) [16:50:37] (03PS2) 10Muehlenhoff: zuul-merger: Remove now obsolete variable [puppet] - 10https://gerrit.wikimedia.org/r/1088572 [16:52:41] (03PS2) 10Brouberol: analytics_test_cluster: enable egress from the dse kubepods network [puppet] - 10https://gerrit.wikimedia.org/r/1088596 (https://phabricator.wikimedia.org/T377602) [16:54:04] (03CR) 10Btullis: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1088596 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [16:55:12] (03PS1) 10Xcollazo: Move start day of dump_fillin_wd job from the 7th to the 10th of the month [puppet] - 10https://gerrit.wikimedia.org/r/1088599 (https://phabricator.wikimedia.org/T379393) [16:55:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bookworm [16:56:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10304713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm [16:56:04] (03CR) 10Brouberol: [C:03+2] analytics_test_cluster: enable egress from the dse kubepods network [puppet] - 10https://gerrit.wikimedia.org/r/1088596 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [16:58:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:58:25] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [17:00:02] (03PS1) 10Brouberol: fix typo in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1088600 (https://phabricator.wikimedia.org/T377602) [17:00:41] (03CR) 10Btullis: [C:03+1] fix typo in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1088600 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [17:00:48] (03CR) 10Brouberol: [C:03+2] fix typo in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1088600 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [17:01:29] (03CR) 10Dzahn: [C:03+1] fix typo in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1088600 (https://phabricator.wikimedia.org/T377602) (owner: 10Brouberol) [17:04:05] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1088572/4483/contint2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1088572 (owner: 10Muehlenhoff) [17:04:24] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1088572 (owner: 10Muehlenhoff) [17:05:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:05:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2136.codfw.wmnet with OS bookworm [17:05:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2136.codfw.wmnet with OS bo... [17:05:35] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2082.codfw.wmnet with OS bookworm [17:05:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10304733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm execut... [17:07:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2163.codfw.wmnet with OS bookworm [17:07:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2159.codfw.wmnet with reason: host reimage [17:07:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2156.codfw.wmnet with reason: host reimage [17:07:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2162.codfw.wmnet with reason: host reimage [17:07:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2160.codfw.wmnet with reason: host reimage [17:07:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2161.codfw.wmnet with reason: host reimage [17:08:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2157.codfw.wmnet with reason: host reimage [17:08:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2145.codfw.wmnet with reason: host reimage [17:08:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2144.codfw.wmnet with reason: host reimage [17:08:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304737 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm [17:08:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2158.codfw.wmnet with reason: host reimage [17:09:12] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1005.eqiad.wmnet [17:09:18] !log herron@cumin1002 START - Cookbook sre.dns.netbox [17:10:35] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [17:10:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10304739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [17:11:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2159.codfw.wmnet with reason: host reimage [17:12:03] (03PS1) 10FNegri: prometheus-node-kernel-panic: refactor and improve [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [17:13:46] (03PS2) 10FNegri: prometheus-node-kernel-panic: refactor and improve [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [17:13:50] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker1005.eqiad.wmnet - herron@cumin1002" [17:13:54] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker1005.eqiad.wmnet - herron@cumin1002" [17:13:54] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:54] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1005.eqiad.wmnet on all recursors [17:13:58] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1005.eqiad.wmnet on all recursors [17:14:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2161.codfw.wmnet with reason: host reimage [17:14:25] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker1005.eqiad.wmnet - herron@cumin1002" [17:14:29] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker1005.eqiad.wmnet - herron@cumin1002" [17:15:17] (03PS3) 10FNegri: prometheus-node-kernel-panic: refactor and improve [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [17:15:38] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1005.eqiad.wmnet with OS bookworm [17:15:48] 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10304750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-worker1005.eqiad.wmnet with OS bookworm [17:17:00] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2082.codfw.wmnet with OS bullseye [17:17:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10304752 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye execut... [17:17:34] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [17:17:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2156.codfw.wmnet with reason: host reimage [17:17:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10304755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [17:20:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2162.codfw.wmnet with reason: host reimage [17:21:48] PROBLEM - MariaDB Replica SQL: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table frwiki.geo_tags: Index for table geo_tags is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1155-bin.003753, end_log_pos 323829635 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_r [17:22:09] jouncebot: nowandnext [17:22:09] For the next 14 hour(s) and 37 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241108T0800) [17:22:09] In 14 hour(s) and 37 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241109T0800) [17:23:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2158.codfw.wmnet with reason: host reimage [17:23:48] I'm going to bend rules and deploy some low-risk changes that only affect testwiki and testcommonswiki [17:25:19] I am going to help analytics people and fix the an-redactdb1001 issue [17:26:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2160.codfw.wmnet with reason: host reimage [17:27:12] !log rebuild frwiki.geo_tags @ an-redacteddb1001 [17:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:48] RECOVERY - MariaDB Replica SQL: s6 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:27:59] ^ fixed [17:29:32] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [17:30:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:31:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2157.codfw.wmnet with reason: host reimage [17:32:47] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1005.eqiad.wmnet with reason: host reimage [17:32:52] (03PS4) 10FNegri: prometheus-node-kernel-panic: ignore false warnings [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [17:34:05] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [17:35:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:36:52] (03PS5) 10FNegri: prometheus-node-kernel-panic: ignore false warnings [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [17:36:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:36:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2159.codfw.wmnet with OS bookworm [17:37:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2159.codfw.wmnet with OS bookworm completed: - wi... [17:37:34] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1005.eqiad.wmnet with reason: host reimage [17:37:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2164.codfw.wmnet with OS bookworm [17:37:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker2144.codfw.wmnet with reason: host reimage [17:37:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2164.codfw.wmnet with OS bookworm [17:38:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:38:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2161.codfw.wmnet with OS bookworm [17:38:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2161.codfw.wmnet with OS bookworm completed: - wi... [17:38:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:39:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2165.codfw.wmnet with OS bookworm [17:39:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2165.codfw.wmnet with OS bookworm [17:39:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:39:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2156.codfw.wmnet with OS bookworm [17:39:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2156.codfw.wmnet with OS bookworm completed: - wi... [17:40:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:40:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2145.codfw.wmnet with reason: host reimage [17:40:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2166.codfw.wmnet with OS bookworm [17:40:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2166.codfw.wmnet with OS bookworm [17:42:21] (03PS6) 10Aude: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) [17:42:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:42:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2162.codfw.wmnet with OS bookworm [17:42:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:42:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2162.codfw.wmnet with OS bookworm completed: - wi... [17:43:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2167.codfw.wmnet with OS bookworm [17:43:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2167.codfw.wmnet with OS bookworm [17:44:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:44:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2158.codfw.wmnet with OS bookworm [17:44:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2158.codfw.wmnet with OS bookworm completed: - wi... [17:44:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2168.codfw.wmnet with OS bookworm [17:45:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304821 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2168.codfw.wmnet with OS bookworm [17:45:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:46:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:46:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2160.codfw.wmnet with OS bookworm [17:47:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2160.codfw.wmnet with OS bookworm completed: - wi... [17:47:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2169.codfw.wmnet with OS bookworm [17:48:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2169.codfw.wmnet with OS bookworm [17:48:09] (03PS1) 10Cathal Mooney: Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) [17:49:22] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:49:24] (03CR) 10CI reject: [V:04-1] Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [17:49:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:50:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:50:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2157.codfw.wmnet with OS bookworm [17:50:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2157.codfw.wmnet with OS bookworm completed: - wi... [17:50:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2170.codfw.wmnet with OS bookworm [17:51:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2170.codfw.wmnet with OS bookworm [17:52:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2163.codfw.wmnet with OS bookworm [17:52:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10304870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm executed with e... [17:54:33] (03PS1) 10Brouberol: Fix typos in analytics-hadoop-test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088606 (https://phabricator.wikimedia.org/T379363) [17:54:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:55:14] (03PS2) 10Brouberol: airflow: release airflow 2.10.3 on our test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088575 (https://phabricator.wikimedia.org/T379136) [17:55:48] (03CR) 10Btullis: [C:03+1] airflow: release airflow 2.10.3 on our test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088575 (https://phabricator.wikimedia.org/T379136) (owner: 10Brouberol) [17:56:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2164.codfw.wmnet with reason: host reimage [17:56:03] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1005.eqiad.wmnet with OS bookworm [17:56:03] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker1005.eqiad.wmnet [17:56:12] 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10304882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-worker1005.eqiad.wmnet with OS bookworm completed: - aux-k8s-worker1005 (**PASS**)... [17:56:19] (03PS2) 10Cathal Mooney: Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) [17:56:38] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [17:56:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2163.codfw.wmnet with OS bookworm [17:56:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:56:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10304883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple... [17:56:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2144.codfw.wmnet with OS bookworm [17:56:58] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Create new snippets for frack IPs - cmooney@cumin1002" [17:57:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Create new snippets for frack IPs - cmooney@cumin1002" [17:57:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2165.codfw.wmnet with reason: host reimage [17:57:36] (03CR) 10CI reject: [V:04-1] Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [17:59:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2166.codfw.wmnet with reason: host reimage [17:59:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:59:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:59:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2145.codfw.wmnet with OS bookworm [18:00:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2145.codfw.wmnet with OS bo... [18:01:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2164.codfw.wmnet with reason: host reimage [18:01:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2167.codfw.wmnet with reason: host reimage [18:02:12] (03PS7) 10Aude: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) [18:03:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304898 (10Jhancock.wm) [18:03:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2168.codfw.wmnet with reason: host reimage [18:03:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10304899 (10Jhancock.wm) [18:04:19] (03CR) 10Aude: [C:03+1] Reviving "Update interwiki map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088586 (owner: 10Seddon) [18:04:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2165.codfw.wmnet with reason: host reimage [18:06:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2169.codfw.wmnet with reason: host reimage [18:07:51] (03CR) 10Ssingh: Remove manual A and PTR records for frack and add Netbox includes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:07:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2166.codfw.wmnet with reason: host reimage [18:10:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2170.codfw.wmnet with reason: host reimage [18:10:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2169.codfw.wmnet with reason: host reimage [18:12:07] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2170.codfw.wmnet with reason: host reimage [18:17:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2167.codfw.wmnet with reason: host reimage [18:17:34] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:19:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:19:34] (03PS1) 10Aleksandar Mastilovic: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 [18:20:29] (03CR) 10CI reject: [V:04-1] Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (owner: 10Aleksandar Mastilovic) [18:20:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2168.codfw.wmnet with reason: host reimage [18:21:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:21:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2164.codfw.wmnet with OS bookworm [18:21:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2164.codfw.wmnet with OS bookworm completed: - wi... [18:21:23] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Create new snippets for frack IPs - cmooney@cumin1002" [18:21:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Create new snippets for frack IPs - cmooney@cumin1002" [18:21:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:26:10] (03PS3) 10Cathal Mooney: Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) [18:26:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:26:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2165.codfw.wmnet with OS bookworm [18:26:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2165.codfw.wmnet with OS bookworm completed: - wi... [18:27:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:27:19] (03CR) 10CI reject: [V:04-1] Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:27:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:27:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2166.codfw.wmnet with OS bookworm [18:27:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2166.codfw.wmnet with OS bookworm completed: - wi... [18:27:45] (03CR) 10Cathal Mooney: Remove manual A and PTR records for frack and add Netbox includes (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:29:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:31:07] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:31:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:31:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2169.codfw.wmnet with OS bookworm [18:31:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2169.codfw.wmnet with OS bookworm completed: - wi... [18:32:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:33:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:33:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2170.codfw.wmnet with OS bookworm [18:33:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2170.codfw.wmnet with OS bookworm completed: - wi... [18:36:35] (03PS4) 10Cathal Mooney: Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) [18:37:06] (03PS1) 10Ilias Sarantopoulos: ml-services: update aya model deployment to aya-expanse-8b [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088609 (https://phabricator.wikimedia.org/T379052) [18:37:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:38:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:38:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2167.codfw.wmnet with OS bookworm [18:38:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2167.codfw.wmnet with OS bookworm completed: - wi... [18:39:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10305066 (10Khantstop) @MatthewVernon are you able to add me to sql_lab role as well? I don't think this was... [18:39:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:40:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2163.codfw.wmnet with OS bookworm [18:40:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm executed with e... [18:40:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:40:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2168.codfw.wmnet with OS bookworm [18:40:47] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10305073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2168.codfw.wmnet with OS bookworm completed: - wi... [18:44:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:45:03] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1088610 [18:45:31] (03CR) 10Ssingh: [C:03+1] "Looks good. I would say let's merge on Monday unless there is a pressing need for today? If today, we should double-check to make sure thi" [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:46:47] (03CR) 10Ssingh: [C:03+1] "Let's merge after your post-script checks." [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:49:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:09] (03PS1) 10Ssingh: wikimedia.org: remove obsolete records for pay-lvs100[12].wm.org [dns] - 10https://gerrit.wikimedia.org/r/1088612 [18:56:18] (03CR) 10Ssingh: "Observed with Cathal in Iac1702f341fac35fe93f69cbc0e3f736e2ebffd8. We should remove these if they are no longer required. Needs fr-tech ap" [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [18:56:35] (03CR) 10Cathal Mooney: [C:03+2] Remove manual A and PTR records for frack and add Netbox includes [dns] - 10https://gerrit.wikimedia.org/r/1088605 (https://phabricator.wikimedia.org/T377996) (owner: 10Cathal Mooney) [18:57:12] (03PS2) 10Ssingh: wikimedia.org: remove obsolete records for pay-lvs100[12].wm.org [dns] - 10https://gerrit.wikimedia.org/r/1088612 [18:59:15] (03PS1) 10Andrea Denisse: grafana: Fix login redirection to preserve dashboard context [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) [19:03:59] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for khantstop - https://phabricator.wikimedia.org/T379409 (10Khantstop) 03NEW [19:05:00] (03PS1) 10Dzahn: gerrit: set gerrit site dir Hiera value for new machine gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1088613 (https://phabricator.wikimedia.org/T338470) [19:15:49] (03CR) 10Dzahn: [C:03+1] "only affecting host not yet in production" [puppet] - 10https://gerrit.wikimedia.org/r/1088613 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:20:34] (03CR) 10Herron: "Thanks! I copied this to grafana-next for a quick test and it seems better but I didn't get it fully redirecting to the idp login yet. I" [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [19:25:00] (03CR) 10Andrea Denisse: "The plan is to test this in the grafana-next hosts and to test this with HTTPB" [puppet] - 10https://gerrit.wikimedia.org/r/1088611 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [19:29:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:33:37] (03CR) 10Aude: Reopen testcommonswiki for testing Chart extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:34:08] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [19:34:55] PROBLEM - Host ms-be2083 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:50] mutante: denisse: aude and I are going to bend rules a little bit and deploy some low-risk changes that only affect testwiki and testcommonswiki [19:36:08] cdanis: ACK, thanks! [19:36:15] RECOVERY - Host ms-be2083 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [19:36:27] cdanis: I think "test wiki" only is not that bad :) [19:37:23] RECOVERY - Disk space on thanos-be1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [19:39:28] (03CR) 10CDanis: [C:03+1] Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:40:09] (03PS1) 10Andrea Denisse: grafana: Allow HTTP access from the deployment-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1088616 (https://phabricator.wikimedia.org/T379043) [19:40:21] i'm gonna check why ms-be2083 is down. probably disk failure again [19:40:48] well, uptime 5 min, heh [19:40:51] RECOVERY - Disk space on thanos-be1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [19:42:14] (03PS2) 10Bking: Adding a Helm chart for HDFS Synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077106 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [19:42:19] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [19:42:44] (03CR) 10Dzahn: [C:03+1] "yea, we do this for a bunch of services where we want to be able to run httpb tests against backends. it's either deployment or cumin host" [puppet] - 10https://gerrit.wikimedia.org/r/1088616 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [19:43:00] (03CR) 10Bking: [C:03+2] Adding a Helm chart for HDFS Synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077106 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [19:43:52] (03Merged) 10jenkins-bot: Adding a Helm chart for HDFS Synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077106 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [19:43:53] (03CR) 10Andrea Denisse: [C:03+2] grafana: Allow HTTP access from the deployment-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1088616 (https://phabricator.wikimedia.org/T379043) (owner: 10Andrea Denisse) [19:44:11] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [19:44:46] mutante: that was my fault sorry [19:44:53] should have downtimed it first [19:45:01] jhathaway: ACK! thanks :) [19:46:58] (03PS2) 10Aleksandar Mastilovic: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 [19:47:02] (03Merged) 10jenkins-bot: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:47:45] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1088366|Reopen testcommonswiki for testing Chart extension]] [19:50:27] !log aude@deploy2002 aude: Backport for [[gerrit:1088366|Reopen testcommonswiki for testing Chart extension]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:54:05] RECOVERY - Disk space on thanos-be1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [19:54:05] RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [19:57:36] !log aude@deploy2002 aude: Continuing with sync [19:59:01] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [19:59:27] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [19:59:30] T371400: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 [19:59:41] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [20:02:19] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088366|Reopen testcommonswiki for testing Chart extension]] (duration: 14m 33s) [20:03:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [20:04:05] (03Merged) 10jenkins-bot: Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [20:04:22] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1088375|Enable Tabular data for test commons (T378127)]] [20:04:25] T378127: Enable Chart extension on testwiki and testcommons - https://phabricator.wikimedia.org/T378127 [20:06:52] !log aude@deploy2002 aude: Backport for [[gerrit:1088375|Enable Tabular data for test commons (T378127)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:40] !log aude@deploy2002 aude: Continuing with sync [20:15:17] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088375|Enable Tabular data for test commons (T378127)]] (duration: 10m 55s) [20:15:27] T378127: Enable Chart extension on testwiki and testcommons - https://phabricator.wikimedia.org/T378127 [20:17:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088586 (owner: 10Seddon) [20:18:03] (03Merged) 10jenkins-bot: Reviving "Update interwiki map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088586 (owner: 10Seddon) [20:18:21] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1088586|Reviving "Update interwiki map"]] [20:20:27] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [20:20:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10305421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [20:21:53] !log aude@deploy2002 seddon, aude: Backport for [[gerrit:1088586|Reviving "Update interwiki map"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:24:02] !log aude@deploy2002 seddon, aude: Continuing with sync [20:28:41] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088586|Reviving "Update interwiki map"]] (duration: 10m 19s) [20:35:19] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [20:39:31] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [20:39:34] (03PS1) 10Varnent: Update Wikimedia Foundation primary address. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088628 (https://phabricator.wikimedia.org/T379417) [20:46:12] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088628 (https://phabricator.wikimedia.org/T379417) (owner: 10Varnent) [20:52:53] (03CR) 10Dzahn: [C:03+1] "thanks for doing this. confirmed it matches https://donate.wikimedia.org/wiki/Tax_deductibility" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088628 (https://phabricator.wikimedia.org/T379417) (owner: 10Varnent) [20:55:32] (03CR) 10Dzahn: [C:03+2] gerrit: set gerrit site dir Hiera value for new machine gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1088613 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [21:01:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:02:02] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:28] !log cumin2002 - sudo systemctl status httpbb_kubernetes_mw-api-int_hourly [21:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:02] cdanis: the test that failed there was that it got a 503 for a single specific URL on it.wikipedia.org, all others worked. [21:06:16] then when I ran the test manually it already all worked again [21:06:27] mutante: interesting, we had a single 503 for a specific url on meta, before [21:06:34] but it was like very temp [21:06:43] now it's all clear [21:06:45] 20:20:19 Check 'check_testservers_k8s-1_of_1' failed: Sending to mwdebug.discovery.wmnet... [21:06:47] https://www.mediawiki.org/FAQ (/srv/deployment/httpbb-tests/appserver/test_redirects.yaml:30) [21:06:49] Status code: expected 301, got 503. [21:06:51] Location header: expected 'https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ', was missing. [21:06:53] === [21:06:55] FAIL: 131 requests sent to mwdebug.discovery.wmnet. 1 request with failed assertions. [21:06:57] yeah [21:06:57] https://www.wikipedia.org/wiki/it:Saturno_(astronomia)?a=test [21:06:59] it was this one [21:06:59] and also worked on a retry [21:07:02] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:07:15] hmm [21:07:49] ● httpbb_hourly_appserver.timer not-found failed failed httpbb_hourly_appserver.timer [21:07:52] eh..what [21:08:08] must be the bare metal version that is not-found [21:08:36] !log cumint2002 [cumin2002:~] $ sudo systemctl reset-failed [21:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:47] running puppet to see if that comes back [21:11:06] yea, I can't reproduce either of that now [21:11:11] no failed units of any kind [21:11:14] well, sounds good to me [21:11:18] good, same [21:11:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:12:06] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2082.codfw.wmnet with OS bullseye [21:12:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10305583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye execut... [21:17:59] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:18:07] !log disabling Puppet on grafana2001 - T379043 [21:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:10] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:18:19] T379043: Login through Grafana using the login link do not work - https://phabricator.wikimedia.org/T379043 [21:21:38] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10305613 (10Ferien) Please note I have filed the task T378406 - summary: channels ke... [21:23:07] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:23:45] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:28:45] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:29:07] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:29:36] ^ this is testing work [21:37:27] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:38:05] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:38:17] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:39:55] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:08:26] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [22:08:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10305771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [22:28:54] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2082.codfw.wmnet with OS bullseye [22:29:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10305792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye execut... [22:29:20] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [22:29:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10305793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [22:33:42] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088653 (https://phabricator.wikimedia.org/T219903) [22:36:23] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088653 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [22:37:45] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088653 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [22:38:30] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:38:48] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:38:50] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:39:13] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:39:14] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:39:32] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:41:13] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:41:34] (03PS1) 10Reedy: CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 [22:44:03] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:48:40] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088658 (https://phabricator.wikimedia.org/T219903) [22:48:45] (03CR) 10Dwisehaupt: [C:03+1] "We are no longer using pay-lvs1001 and pay-lvs1002 (moved on to 1003/1004) but also I've never used a public IP to access or address them " [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [22:50:10] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088658 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [22:50:33] (03CR) 10DDesouza: [V:03+2 C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088658 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [22:51:06] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:51:23] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:51:24] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:51:45] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:51:46] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:52:02] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:54:04] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:54:07] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:54:09] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:54:11] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:54:13] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:54:15] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [23:07:41] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [23:07:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10305908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple... [23:16:05] !log ran `delete from oathauth_devices where oad_id=4506;` on centralauth for T379398 because oad_user=0 [23:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:09] T379398: Wikitech users being unexpectedly prompted for 2FA tokens - https://phabricator.wikimedia.org/T379398 [23:23:13] (03CR) 10Zabe: [C:03+1] CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 (owner: 10Reedy) [23:35:26] !log attach Sotiale's local accounts on newly created wikis [23:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log