[00:06:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1064133 (owner: 10TrainBranchBot) [00:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:02] (03CR) 10Cwhite: [C:03+1] alert: Add the alert[12]002 hosts to Prometheus blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1064097 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [00:09:47] (03CR) 10Ssingh: [C:03+1] varnish: Remove unused browser security checks [puppet] - 10https://gerrit.wikimedia.org/r/1064125 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [00:10:41] (03CR) 10Cwhite: [C:03+1] alert: Add the alert[12]002 hosts to acme chief [puppet] - 10https://gerrit.wikimedia.org/r/1064107 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [00:11:48] (03CR) 10Cwhite: [C:03+1] alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [00:13:11] (03CR) 10Dzahn: [C:03+2] gitlab: add replica hosts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [00:24:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:26:45] (03CR) 10Dzahn: [C:03+2] "uhm.. this isn't working.This needs to be applied on the icinga machine. So have to move this to class icinga::monitor::gitlab where we ha" [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [00:30:31] (03PS1) 10Dzahn: icinga: add gitlab-replica service names as monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/1064136 (https://phabricator.wikimedia.org/T363564) [00:31:58] (03CR) 10Dzahn: [C:03+2] "This has to be in the icinga role, not the gitlab role, to work. Just adding next to existing gitlab.wikimedia.org special host. Not addi" [puppet] - 10https://gerrit.wikimedia.org/r/1064136 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [00:36:16] (03PS1) 10Dzahn: Revert "gitlab: add replica hosts to Icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1064137 [00:37:03] (03CR) 10Dzahn: [C:03+2] "replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064136" [puppet] - 10https://gerrit.wikimedia.org/r/1064136 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [00:37:23] (03CR) 10Dzahn: [C:03+2] "see https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gitlab-replica now" [puppet] - 10https://gerrit.wikimedia.org/r/1064136 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [00:38:01] (03CR) 10Dzahn: "I added the replica hosts to Icinga. see:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [00:40:04] (03CR) 10Dzahn: [C:03+2] Revert "gitlab: add replica hosts to Icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1064137 (owner: 10Dzahn) [01:04:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:33:40] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10079939 (10Krinkle) [01:38:45] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:42:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [01:43:45] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:53:45] RESOLVED: [3x] Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:57:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:20:04] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10079968 (10CDanis) [02:36:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:26] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:31:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:01:14] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10080000 (10Marostegui) This has caused another page in production T372961 [04:18:52] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10080020 (10Marostegui) [04:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:34:26] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:26] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: post backup w/o prefetch repooling', diff saved to https://phabricator.wikimedia.org/P67406 and previous config saved to /var/cache/conftool/dbconfig/20240821-054057-arnaudb.json [05:50:47] (03CR) 10Arnaudb: "this is for a quick test of the new cookbook logic brought by @rcoccioli@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [05:51:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 15%: post backup w/o prefetch repooling', diff saved to https://phabricator.wikimedia.org/P67407 and previous config saved to /var/cache/conftool/dbconfig/20240821-055602-arnaudb.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T0600) [06:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: post backup w/o prefetch repooling', diff saved to https://phabricator.wikimedia.org/P67408 and previous config saved to /var/cache/conftool/dbconfig/20240821-061108-arnaudb.json [06:26:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: post backup w/o prefetch repooling', diff saved to https://phabricator.wikimedia.org/P67409 and previous config saved to /var/cache/conftool/dbconfig/20240821-062613-arnaudb.json [06:41:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: post backup w/o prefetch repooling', diff saved to https://phabricator.wikimedia.org/P67410 and previous config saved to /var/cache/conftool/dbconfig/20240821-064119-arnaudb.json [06:56:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: post backup w/o prefetch repooling', diff saved to https://phabricator.wikimedia.org/P67411 and previous config saved to /var/cache/conftool/dbconfig/20240821-065624-arnaudb.json [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:24] !log remove bgp session to mw2291 on codfw routers (host renumbered) [07:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:52] (03CR) 10Marostegui: [C:04-1] "What makes you guys think the RAID isn't set up?" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [07:08:55] (03CR) 10Arnaudb: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [07:09:17] (03CR) 10Marostegui: [C:04-1] "I still don't see the logic of wanting this to replicate from s1." [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [07:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:12:12] (03CR) 10Marostegui: [C:04-1] "Yes, I'd prefer if the yaml files would be cleaner as in: no new host, a bit of description of why some other hostnames are in there etc." [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [07:17:01] (03PS3) 10Arnaudb: mariadb: temporary testing environment [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) [07:22:30] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10080170 (10Peachey88) [07:22:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:27:27] !log brouberol@cumin1002 START - Cookbook sre.puppet.renew-cert for wdqs1024.eqiad.wmnet: Renew puppet certificate - brouberol@cumin1002 [07:27:46] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for wdqs1024.eqiad.wmnet: Renew puppet certificate - brouberol@cumin1002 [07:30:12] (03CR) 10Jelto: [C:03+2] gitlab: Allow WMCS runners to use registry.cloud.releng.team [puppet] - 10https://gerrit.wikimedia.org/r/1064084 (https://phabricator.wikimedia.org/T372848) (owner: 10BryanDavis) [07:30:53] (03CR) 10Jelto: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1064084 (https://phabricator.wikimedia.org/T372848) (owner: 10BryanDavis) [07:33:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67412 and previous config saved to /var/cache/conftool/dbconfig/20240821-073348-root.json [07:37:24] (03CR) 10Jelto: [C:03+1] "lgtm, but we probably need the additional service names for T372804 again" [puppet] - 10https://gerrit.wikimedia.org/r/1064091 (owner: 10Dzahn) [07:39:54] !log enable cloudsw1-d5-eqiad:xe-0/0/21 (SFP now inserted) [07:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:37] (03CR) 10Jelto: [C:04-1] "Leaving the burst parameter as optional and undefined was intentional. The `throttling.nft.erb` uses `<% if defined?(@allowed_burst_packet" [puppet] - 10https://gerrit.wikimedia.org/r/1064065 (owner: 10Dzahn) [07:42:36] !log rollback JIO_DIRECT from cr2-eqsin AVOID-PATHS [07:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:41] cc topranks ^ [07:44:46] (03PS4) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [07:46:32] (03CR) 10Jelto: "I don't think the downtime issue in T363564 is related to the replicas not being in icinga. The issue happens most of the time on the prod" [puppet] - 10https://gerrit.wikimedia.org/r/1064136 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [07:46:59] (03CR) 10CI reject: [V:04-1] Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [07:48:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67413 and previous config saved to /var/cache/conftool/dbconfig/20240821-074854-root.json [07:48:59] (03CR) 10Jelto: [C:04-1] "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064136/2#message-c4fef6f8a04bcd1f9508c24d834aea0729644b88" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [07:55:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:52] (03CR) 10Klausman: [C:03+1] httpbb: add post deployment tests for the rec-api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064021 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [08:00:04] andre and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T0800). [08:00:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:04:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67414 and previous config saved to /var/cache/conftool/dbconfig/20240821-080359-root.json [08:06:28] I will now start promoting group1 wikis to 1.43.0-wmf.19 [08:07:33] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064326 (https://phabricator.wikimedia.org/T366964) [08:07:35] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064326 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [08:08:18] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064326 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [08:11:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:36] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.19 refs T366964 [08:15:40] T366964: 1.43.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T366964 [08:19:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67415 and previous config saved to /var/cache/conftool/dbconfig/20240821-081904-root.json [08:21:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:27] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:21:47] (03PS1) 10Marostegui: installserver: Do not format db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1064327 [08:23:13] (03PS1) 10Jelto: gitlab: use burst parameter with nftables_throttling [puppet] - 10https://gerrit.wikimedia.org/r/1064328 (https://phabricator.wikimedia.org/T366882) [08:24:01] (03PS3) 10Btullis: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [08:26:21] (03PS4) 10Btullis: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [08:26:27] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:26:31] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1064327 (owner: 10Marostegui) [08:27:39] (03PS1) 10Filippo Giunchedi: pontoon: fix lb indentation [puppet] - 10https://gerrit.wikimedia.org/r/1064329 [08:27:39] (03PS1) 10Filippo Giunchedi: pontoon: add o11y-phi stack [puppet] - 10https://gerrit.wikimedia.org/r/1064330 [08:28:10] (03CR) 10CI reject: [V:04-1] pontoon: add o11y-phi stack [puppet] - 10https://gerrit.wikimedia.org/r/1064330 (owner: 10Filippo Giunchedi) [08:30:03] (03PS5) 10Btullis: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [08:31:20] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix lb indentation [puppet] - 10https://gerrit.wikimedia.org/r/1064329 (owner: 10Filippo Giunchedi) [08:33:18] (03PS2) 10Filippo Giunchedi: pontoon: add o11y-phi stack [puppet] - 10https://gerrit.wikimedia.org/r/1064330 [08:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67416 and previous config saved to /var/cache/conftool/dbconfig/20240821-083410-root.json [08:35:07] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add o11y-phi stack [puppet] - 10https://gerrit.wikimedia.org/r/1064330 (owner: 10Filippo Giunchedi) [08:36:02] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, we can merge this at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1064107 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:36:28] (03PS5) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [08:36:53] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Add the alert[12]002 hosts to Prometheus blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1064097 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:38:19] (03CR) 10Filippo Giunchedi: "LGTM, though it might need rebasing/change re: the acme-chief config which is also changed by Iefa94982e63" [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:38:46] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, this can be merged at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:38:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:34] (03CR) 10Filippo Giunchedi: "LGTM, though this patch should also add 1002 to "alertmanagers" variable" [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:39:37] (03PS6) 10Btullis: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [08:41:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [08:44:59] (03PS1) 10Filippo Giunchedi: pontoon: remove obsolete observability stack [puppet] - 10https://gerrit.wikimedia.org/r/1064331 [08:46:49] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: remove obsolete observability stack [puppet] - 10https://gerrit.wikimedia.org/r/1064331 (owner: 10Filippo Giunchedi) [08:47:17] (03CR) 10Ayounsi: Fix puppet import so it doesn't fail if parent prefix has no role (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) (owner: 10Cathal Mooney) [08:48:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67417 and previous config saved to /var/cache/conftool/dbconfig/20240821-084915-root.json [08:53:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:53] (03PS14) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [08:58:21] (03CR) 10AOkoth: sql_exporter: specify column for metric (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:58:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:02] (03CR) 10Brouberol: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [09:04:10] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67418 and previous config saved to /var/cache/conftool/dbconfig/20240821-090421-root.json [09:05:46] (03CR) 10Vgutierrez: [C:03+1] ncmonitor: Remove duplicate sysuser creation [puppet] - 10https://gerrit.wikimedia.org/r/1062449 (owner: 10BCornwall) [09:06:23] (03CR) 10Vgutierrez: "this can be abandoned now" [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [09:09:23] (03CR) 10Vgutierrez: [C:03+1] "looks good, this could be handy for the RSA deprecation and TLSv1.2 deprecation as well BTW" [puppet] - 10https://gerrit.wikimedia.org/r/1064125 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [09:11:30] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q1): Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#10080352 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Alert is gone as... [09:13:49] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 28173 [09:14:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 28173 [09:14:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 262725 [09:16:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 262725 [09:18:39] (03CR) 10Btullis: [V:03+1] Add TLS envoyproxy to the radosgw services on the DPE ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [09:19:23] (03PS15) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [09:22:01] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#10080378 (10SLyngshede-WMF) 05Open→03Resolved [09:22:08] (03CR) 10AOkoth: sql_exporter: specify column for metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [09:23:00] (03PS1) 10Brouberol: airflow-test-k8s: configure the ingress to account for the external DNS domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064334 (https://phabricator.wikimedia.org/T363001) [09:23:35] (03PS16) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [09:24:49] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064334 (https://phabricator.wikimedia.org/T363001) (owner: 10Brouberol) [09:30:43] 06SRE, 10observability, 10SRE Observability (FY2024/2025-Q1), 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788#10080413 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm optimistically calling this... [09:31:15] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: configure the ingress to account for the external DNS domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064334 (https://phabricator.wikimedia.org/T363001) (owner: 10Brouberol) [09:32:40] (03CR) 10AOkoth: [C:03+1] gitlab: use burst parameter with nftables_throttling [puppet] - 10https://gerrit.wikimedia.org/r/1064328 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:33:33] (03PS1) 10Slyngshede: P:idp Remove CAS 6.6 test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064335 (https://phabricator.wikimedia.org/T372997) [09:34:01] (03CR) 10Btullis: [V:03+1] Add TLS envoyproxy to the radosgw services on the DPE ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [09:34:44] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [09:36:18] (03PS17) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [09:38:53] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1063766/3709/" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [09:40:06] 06SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942#10080452 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This was done in {T302832} (alert is a warning, and not spamming anymore) [09:40:19] (03CR) 10Jelto: [C:03+2] gitlab: use burst parameter with nftables_throttling [puppet] - 10https://gerrit.wikimedia.org/r/1064328 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:41:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [09:41:23] 06SRE, 10observability, 06serviceops: aggregate mismatched wikiversions alert - https://phabricator.wikimedia.org/T302832#10080436 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is actually done as the warnings don't spam irc [09:41:36] 06SRE, 10Observability-Alerting: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027#10080461 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm calling this done since we haven't experienced this problem again after https://gerrit.wikimedia.org/r/951592 [09:44:44] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [09:46:09] (03PS1) 10Brouberol: datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) [09:46:10] (03PS1) 10Brouberol: spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) [09:46:12] (03PS1) 10Brouberol: superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) [09:46:13] (03PS1) 10Brouberol: airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) [09:46:15] (03PS1) 10Brouberol: growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) [09:49:02] (03CR) 10Brouberol: growthbook: add digest to image tag, ensuring the image immutability (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [09:51:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [09:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:08] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064343 (https://phabricator.wikimedia.org/T365047) [09:56:08] (03CR) 10Fabfur: "yes, I have a lot of patch to abandon, I'll do a cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [09:56:26] (03Abandoned) 10Fabfur: hiera: test Benthos socket activation on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) (owner: 10Fabfur) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1000) [10:02:18] (03Abandoned) 10Fabfur: cache:benthos: switch to production topic names [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur) [10:04:47] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [10:11:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [10:12:18] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T372916#10080728 (10Clement_Goubert) Apparently the removal from the puppetserver wasn't properly done by the cookbook, I've done it manually and it should resolve. Sor... [10:17:01] (03CR) 10Kevin Bazira: "Thanks for the review Tobias. Please help merge as I don't have +2 rights. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1064021 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [10:21:09] (03PS1) 10Gmodena: data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) [10:26:24] (03PS1) 10Hnowlan: Enable shellbox-video for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064348 (https://phabricator.wikimedia.org/T356241) [10:26:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:33:26] (03CR) 10JMeybohm: [C:03+1] Enable shellbox-video for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064348 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:35:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064348 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:35:38] <3 https://schedule-deployment.toolforge.org/ so much [10:37:28] (03CR) 10Máté Szabó: [C:03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064343 (https://phabricator.wikimedia.org/T365047) (owner: 10STran) [10:39:15] (03PS2) 10Cathal Mooney: Fix puppet import so it doesn't fail if parent prefix has no role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) [10:39:41] (03CR) 10Cathal Mooney: Fix puppet import so it doesn't fail if parent prefix has no role (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) (owner: 10Cathal Mooney) [10:39:43] (03PS1) 10Clément Goubert: sre.hosts: Add git paths on puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/1064346 [10:40:51] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064343 (https://phabricator.wikimedia.org/T365047) (owner: 10STran) [10:41:51] (03PS2) 10Kevin Bazira: httpbb: add post deployment tests for the logo-detection endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) [10:41:58] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064343 (https://phabricator.wikimedia.org/T365047) (owner: 10STran) [10:44:30] (03CR) 10Kevin Bazira: httpbb: add post deployment tests for the logo-detection endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [10:47:16] !log stran@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [10:48:31] (03CR) 10Klausman: [C:03+2] httpbb: add post deployment tests for the logo-detection endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [10:48:36] (03CR) 10Klausman: [C:03+2] httpbb: add post deployment tests for the rec-api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064021 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [10:48:39] !log stran@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:48:41] !log stran@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [10:49:09] (03PS3) 10Kevin Bazira: httpbb: add post deployment tests for the logo-detection endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) [10:49:49] (03CR) 10Klausman: httpbb: add post deployment tests for the logo-detection endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [10:50:49] (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: add post deployment tests for the logo-detection endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [10:50:54] !log stran@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:52:36] !log stran@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:53:44] !log stran@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:54:07] (03CR) 10Ayounsi: [C:03+1] Fix puppet import so it doesn't fail if parent prefix has no role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) (owner: 10Cathal Mooney) [10:54:26] !log stran@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [10:55:20] !log stran@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [10:55:38] 06SRE, 06Infrastructure-Foundations, 10netops: Packet loss reflected in NELs for traffic to Reliance Jio Infocomm Ltd over BBIX Singapore - https://phabricator.wikimedia.org/T373015 (10cmooney) 03NEW p:05Triage→03Low [10:57:38] (03PS2) 10Clément Goubert: sre.hosts.move-vlan: Use puppetserver remote paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1064347 [10:59:16] (03PS1) 10Cathal Mooney: Do not route traffic via Reliance Jio directly in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1064353 (https://phabricator.wikimedia.org/T373015) [11:00:03] (03CR) 10Cathal Mooney: [C:03+2] Do not route traffic via Reliance Jio directly in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1064353 (https://phabricator.wikimedia.org/T373015) (owner: 10Cathal Mooney) [11:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1100). [11:00:22] (03PS1) 10Slyngshede: Enable Redis and TOTP support. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1064354 (https://phabricator.wikimedia.org/T372892) [11:00:35] (03Merged) 10jenkins-bot: Do not route traffic via Reliance Jio directly in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1064353 (https://phabricator.wikimedia.org/T373015) (owner: 10Cathal Mooney) [11:01:50] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [11:02:38] (03PS2) 10Slyngshede: Enable Redis and TOTP support. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1064354 (https://phabricator.wikimedia.org/T372892) [11:02:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:02:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:03:50] (03CR) 10Brouberol: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [11:07:53] (03PS1) 10Stevemunene: dns: remove wdqs experimental endpoints [dns] - 10https://gerrit.wikimedia.org/r/1064355 (https://phabricator.wikimedia.org/T371833) [11:08:00] (03PS1) 10Chlod Alejandro: kawikisource: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064356 (https://phabricator.wikimedia.org/T368868) [11:10:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [11:10:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [11:10:26] (03PS2) 10Slyngshede: D:apereo_cas::service Make exposed attributes configurable. [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) [11:10:31] here [11:11:02] (03CR) 10Btullis: [V:03+1] Add TLS envoyproxy to the radosgw services on the DPE ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [11:11:37] Hm [11:12:30] hnowlan: might be something we should talk about on -security? [11:13:04] Emperor: sgtm [11:13:08] (03PS1) 10Slyngshede: P:idp add airflow_test_k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1064357 (https://phabricator.wikimedia.org/T371209) [11:16:51] (03PS7) 10Btullis: Add TLS envoyproxy to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [11:18:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [11:21:34] (03Abandoned) 10Arnaudb: mariadb: temporary testing environment [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [11:23:34] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Packet loss reflected in NELs for traffic to Reliance Jio Infocomm Ltd over BBIX Singapore - https://phabricator.wikimedia.org/T373015#10080912 (10cmooney) Path is now avoided again and packet loss is no longer observable: ` cmooney@bast500... [11:30:06] (03CR) 10Stevemunene: [C:03+1] "looks good!" [labs/private] - 10https://gerrit.wikimedia.org/r/1064357 (https://phabricator.wikimedia.org/T371209) (owner: 10Slyngshede) [11:30:30] (03CR) 10Slyngshede: [V:03+2 C:03+2] P:idp add airflow_test_k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1064357 (https://phabricator.wikimedia.org/T371209) (owner: 10Slyngshede) [11:31:51] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3713/co" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [11:36:39] Some countries (Russia, Belarus) now have no access to all Wikimedia sites. Timeout error. [11:36:50] (03Abandoned) 10Stevemunene: Add airflow-analytics-test secret [labs/private] - 10https://gerrit.wikimedia.org/r/1057820 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [11:53:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [11:56:46] (03PS1) 10Chlod Alejandro: kaawiktionary: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064363 (https://phabricator.wikimedia.org/T368868) [12:00:18] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:00:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [12:06:37] (03Abandoned) 10Máté Szabó: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063227 (https://phabricator.wikimedia.org/T370502) (owner: 10Máté Szabó) [12:07:22] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:08:55] (03CR) 10Btullis: [V:03+1] Add TLS envoyproxy to the radosgw services on the DPE ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [12:14:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [12:22:40] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [12:22:47] !log install python3-pynetbox_7.4.0 manually on cumin2002 [12:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:39] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [12:28:03] (03CR) 10Filippo Giunchedi: [C:03+1] Netbox-hiera: add device role to mgmt_hosts (try 2) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064061 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [12:28:46] btullis: fyi, you might hit this bug during the re-image: https://phabricator.wikimedia.org/T371653 [12:29:39] I pushed a fix on cumin2002, will add the updated package to apt [12:30:07] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:32:59] (03PS1) 10Filippo Giunchedi: pontoon: clone netbox-hiera [puppet] - 10https://gerrit.wikimedia.org/r/1064367 [12:34:52] !log add python3-pynetbox_7.4.0_all.deb to reprepro - T371890 [12:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:55] T371890: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890 [12:36:11] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: clone netbox-hiera [puppet] - 10https://gerrit.wikimedia.org/r/1064367 (owner: 10Filippo Giunchedi) [12:41:42] (03CR) 10AOkoth: [C:03+2] sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:42:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:42:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:42:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:42:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:42:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T370903)', diff saved to https://phabricator.wikimedia.org/P67420 and previous config saved to /var/cache/conftool/dbconfig/20240821-124252-ladsgroup.json [12:42:56] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:45:31] (03PS2) 10Brouberol: datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) [12:45:32] (03PS2) 10Brouberol: spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) [12:45:32] (03PS2) 10Brouberol: superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) [12:45:32] (03PS2) 10Brouberol: airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) [12:45:33] (03PS2) 10Brouberol: growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) [12:45:34] (03PS1) 10Brouberol: cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) [12:45:38] (03PS1) 10Brouberol: cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) [12:47:36] (03PS1) 10AOkoth: vrts: add from clause to query [puppet] - 10https://gerrit.wikimedia.org/r/1064374 (https://phabricator.wikimedia.org/T310822) [12:48:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T370903)', diff saved to https://phabricator.wikimedia.org/P67421 and previous config saved to /var/cache/conftool/dbconfig/20240821-124828-ladsgroup.json [12:48:29] (03PS1) 10Brouberol: deployment_server: change the PG image tag to timestamp@digest [puppet] - 10https://gerrit.wikimedia.org/r/1064375 (https://phabricator.wikimedia.org/T373000) [12:48:31] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:49:23] (03CR) 10Btullis: [C:03+1] deployment_server: change the PG image tag to timestamp@digest [puppet] - 10https://gerrit.wikimedia.org/r/1064375 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [12:50:20] (03CR) 10Brouberol: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [12:50:41] (03CR) 10Brouberol: [C:03+2] deployment_server: change the PG image tag to timestamp@digest [puppet] - 10https://gerrit.wikimedia.org/r/1064375 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [12:50:43] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1064374/3714/" [puppet] - 10https://gerrit.wikimedia.org/r/1064374 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:50:50] (03CR) 10AOkoth: [C:03+2] vrts: add from clause to query [puppet] - 10https://gerrit.wikimedia.org/r/1064374 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:52:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) (owner: 10أنون) [12:53:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:56:34] (03PS4) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [12:57:27] (03PS2) 10AOkoth: prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) [12:57:34] (03PS3) 10AOkoth: prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) [12:59:57] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1005.eqiad.wmnet with OS bookworm [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1300). [13:00:05] hnowlan and lolekek: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:21] (03CR) 10Ayounsi: [C:03+1] P:idp Remove CAS 6.6 test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064335 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [13:01:24] o/ [13:02:22] (03CR) 10Ayounsi: [C:03+2] Netbox-hiera: add device role to mgmt_hosts (try 2) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064061 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:03:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P67422 and previous config saved to /var/cache/conftool/dbconfig/20240821-130335-ladsgroup.json [13:06:40] (03PS5) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [13:06:42] (03CR) 10Btullis: [V:03+1 C:03+2] Add TLS envoyproxy to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [13:14:18] (03Merged) 10jenkins-bot: Netbox-hiera: add device role to mgmt_hosts (try 2) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064061 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:16:42] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "mgmt: add role - ayounsi@cumin1002" [13:17:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "mgmt: add role - ayounsi@cumin1002" [13:18:08] (03CR) 10Filippo Giunchedi: "This is a good start and not enough, you also have to initialize sth like $sql_exporter_jobs, add the file pattern matching e.g. sql_*.yam" [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [13:18:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P67423 and previous config saved to /var/cache/conftool/dbconfig/20240821-131842-ladsgroup.json [13:20:52] FIRING: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [13:24:10] (03PS1) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1064380 (https://phabricator.wikimedia.org/T368513) [13:24:24] (03PS2) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1064380 (https://phabricator.wikimedia.org/T368513) [13:24:45] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064380 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:25:34] nobody around who can deploy? [13:28:36] (03PS3) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 [13:29:12] hnowlan: I can deploy :) [13:30:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2035.codfw.wmnet with OS bookworm [13:30:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10081292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet... [13:30:41] lolekek: are you around as well? [13:30:45] yes [13:30:52] RESOLVED: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [13:30:55] cdanis: thank you! [13:31:04] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr[1-2]-codfw with reason: test failover lvs2013 to ls2014 [13:31:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr[1-2]-codfw with reason: test failover lvs2013 to ls2014 [13:31:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064348 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:31:32] (03CR) 10Ayounsi: [V:03+1] "Diff looks horrible, but seems correct. Taking any random server it removes it but re-adds it somewhere else. Taking any random switch, it" [puppet] - 10https://gerrit.wikimedia.org/r/1064380 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:31:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10081293 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f69725b2-8f49-49fe-8766-ce7bb9ffa253) set by... [13:32:01] (03PS2) 10Hnowlan: shellbox-video, admin_ng: bump resource limits and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060104 (https://phabricator.wikimedia.org/T356241) [13:32:10] (03Merged) 10jenkins-bot: Enable shellbox-video for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064348 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:32:20] (03CR) 10Filippo Giunchedi: [C:03+1] Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1064380 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:32:37] cdanis: my change will only take effect when it hits prod jobrunners so I don't need to test on mwdebug fwiw [13:32:44] !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1064348|Enable shellbox-video for enwiki (T356241)]] [13:32:47] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:33:05] hnowlan: hah was just about to ask because I thought so [13:33:07] great [13:33:30] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: test failover lvs2013 to ls2014 [13:33:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: test failover lvs2013 to ls2014 [13:33:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T370903)', diff saved to https://phabricator.wikimedia.org/P67424 and previous config saved to /var/cache/conftool/dbconfig/20240821-133349-ladsgroup.json [13:33:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:33:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:33:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10081302 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b60d0e72-74ce-4dee-9bed-2acca82f8655) set by... [13:34:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:34:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T370903)', diff saved to https://phabricator.wikimedia.org/P67425 and previous config saved to /var/cache/conftool/dbconfig/20240821-133411-ladsgroup.json [13:35:06] !log cdanis@deploy1003 hnowlan, cdanis: Backport for [[gerrit:1064348|Enable shellbox-video for enwiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:35:09] !log cdanis@deploy1003 hnowlan, cdanis: Continuing with sync [13:36:41] (03CR) 10Ayounsi: [V:03+1 C:03+2] Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1064380 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:38:48] (03PS1) 10Brouberol: airflow: disable sqlalchemy pooling when deployed against a cloudnative cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064384 (https://phabricator.wikimedia.org/T372286) [13:39:04] !log disable PyBal on lvs2013 to switch traffic to lvs2014 [13:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T370903)', diff saved to https://phabricator.wikimedia.org/P67426 and previous config saved to /var/cache/conftool/dbconfig/20240821-133950-ladsgroup.json [13:39:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:40:02] !log cdanis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064348|Enable shellbox-video for enwiki (T356241)]] (duration: 07m 18s) [13:40:06] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:40:12] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064384 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:40:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) (owner: 10أنون) [13:40:51] (03PS7) 10أنون: [arwikinews]: Upgrade license to CC BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) [13:41:06] (03CR) 10TrainBranchBot: "Approved by cdanis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) (owner: 10أنون) [13:41:19] (03PS1) 10Ayounsi: type Netbox::Device - make role mandatory [puppet] - 10https://gerrit.wikimedia.org/r/1064385 (https://phabricator.wikimedia.org/T368513) [13:41:54] (03Merged) 10jenkins-bot: [arwikinews]: Upgrade license to CC BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) (owner: 10أنون) [13:41:57] (03CR) 10Cathal Mooney: [C:03+1] type Netbox::Device - make role mandatory [puppet] - 10https://gerrit.wikimedia.org/r/1064385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:42:12] !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1063868|[arwikinews]: Upgrade license to CC BY-SA 4.0 (T372730)]] [13:42:16] T372730: Update content license for Arabic Wikinews to CC BY-SA 4.0 - https://phabricator.wikimedia.org/T372730 [13:42:48] (03CR) 10Brouberol: [C:03+2] airflow: disable sqlalchemy pooling when deployed against a cloudnative cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064384 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:43:30] Thanks cdanis! [13:43:40] lolekek: would you like to verify on the debug servers? [13:43:54] sure [13:44:30] !log cdanis@deploy1003 anwon, cdanis: Backport for [[gerrit:1063868|[arwikinews]: Upgrade license to CC BY-SA 4.0 (T372730)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:47:13] verified. [13:47:41] !log cdanis@deploy1003 anwon, cdanis: Continuing with sync [13:49:41] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:50:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:51:27] (03CR) 10Ayounsi: [C:03+1] sre.hosts: Add git paths on puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/1064346 (owner: 10Clément Goubert) [13:51:37] (03CR) 10Cathal Mooney: [C:03+2] Fix puppet import so it doesn't fail if parent prefix has no role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) (owner: 10Cathal Mooney) [13:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:49] (03CR) 10Ayounsi: [C:03+1] sre.hosts.move-vlan: Use puppetserver remote paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1064347 (owner: 10Clément Goubert) [13:52:17] !log cdanis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063868|[arwikinews]: Upgrade license to CC BY-SA 4.0 (T372730)]] (duration: 10m 05s) [13:52:21] T372730: Update content license for Arabic Wikinews to CC BY-SA 4.0 - https://phabricator.wikimedia.org/T372730 [13:52:29] all done [13:52:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:53:30] (03Merged) 10jenkins-bot: Fix puppet import so it doesn't fail if parent prefix has no role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) (owner: 10Cathal Mooney) [13:53:33] Thanks for your efforts [13:53:35] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10081384 (10jhathaway) sure will do [13:54:18] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:54:32] thanks cdanis! [13:55:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P67427 and previous config saved to /var/cache/conftool/dbconfig/20240821-135458-ladsgroup.json [13:55:17] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054866 (owner: 10PipelineBot) [13:55:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061980 (owner: 10PipelineBot) [13:55:25] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062068 (owner: 10PipelineBot) [13:55:29] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10081413 (10ayounsi) Confirmed that cr1-eqiad stopped generating those logs for 10.64.0.82 (prometheus10... [13:55:37] (03CR) 10Filippo Giunchedi: [C:03+1] type Netbox::Device - make role mandatory [puppet] - 10https://gerrit.wikimedia.org/r/1064385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:55:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:55:51] (03CR) 10Ayounsi: [C:03+2] type Netbox::Device - make role mandatory [puppet] - 10https://gerrit.wikimedia.org/r/1064385 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [13:58:00] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:58:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1400) [14:01:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T367856)', diff saved to https://phabricator.wikimedia.org/P67428 and previous config saved to /var/cache/conftool/dbconfig/20240821-140104-marostegui.json [14:01:12] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:03:18] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10081484 (10ayounsi) 05Open→03Resolved a:03ayounsi All done ! [14:09:55] (03PS5) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) [14:09:56] (03CR) 10Klausman: "I am unsure about the etiquette side: should I change the maintainer to myself as well?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [14:10:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P67429 and previous config saved to /var/cache/conftool/dbconfig/20240821-141006-ladsgroup.json [14:12:16] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10081522 (10Andrew) p:05Triage→03Medium [14:16:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P67430 and previous config saved to /var/cache/conftool/dbconfig/20240821-141611-marostegui.json [14:17:21] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10081536 (10Jelto) After reviewing the `DENYLIST` and the nftables logs, we noticed that so... [14:18:54] (03CR) 10JHathaway: [C:03+1] puppet8: add db_user [labs/private] - 10https://gerrit.wikimedia.org/r/1064113 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:19:02] (03CR) 10JHathaway: [C:03+2] puppet8: add db_user [labs/private] - 10https://gerrit.wikimedia.org/r/1064113 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:19:04] (03CR) 10JHathaway: [V:03+2 C:03+2] puppet8: add db_user [labs/private] - 10https://gerrit.wikimedia.org/r/1064113 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:22:16] (03CR) 10JHathaway: [C:03+1] P:idp Remove CAS 6.6 test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064335 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [14:22:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2035.codfw.wmnet with OS bookworm [14:22:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10081572 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet wit... [14:22:52] !log enable PyBal on lvs2013 to swing traffic back from lvs2014 [14:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:07] (03CR) 10JHathaway: [C:03+1] Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [14:25:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T370903)', diff saved to https://phabricator.wikimedia.org/P67431 and previous config saved to /var/cache/conftool/dbconfig/20240821-142514-ladsgroup.json [14:25:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:25:18] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:25:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:25:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T370903)', diff saved to https://phabricator.wikimedia.org/P67432 and previous config saved to /var/cache/conftool/dbconfig/20240821-142536-ladsgroup.json [14:27:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:27:58] (03PS1) 10Cathal Mooney: Allow the selection of any vlan in provision server script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064387 (https://phabricator.wikimedia.org/T365651) [14:28:34] (03CR) 10Clément Goubert: [C:03+2] sre.hosts: Add git paths on puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/1064346 (owner: 10Clément Goubert) [14:28:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T370903)', diff saved to https://phabricator.wikimedia.org/P67433 and previous config saved to /var/cache/conftool/dbconfig/20240821-142858-ladsgroup.json [14:30:00] (03PS1) 10Btullis: cephosd: remove a subset of LVM signatures during reimage [puppet] - 10https://gerrit.wikimedia.org/r/1064388 (https://phabricator.wikimedia.org/T372783) [14:31:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P67434 and previous config saved to /var/cache/conftool/dbconfig/20240821-143118-marostegui.json [14:32:50] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: PuppetDB import failing for lvs2014 - https://phabricator.wikimedia.org/T372931#10081634 (10cmooney) 05Open→03Resolved a:03cmooney Working ok following update to the puppetdb import script. The issue was not actually caused by Ne... [14:36:41] (03PS2) 10Btullis: cephosd: remove a subset of LVM signatures during reimage [puppet] - 10https://gerrit.wikimedia.org/r/1064388 (https://phabricator.wikimedia.org/T372783) [14:37:28] (03PS1) 10Hnowlan: Use shellbox-video for videoscaling on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064389 (https://phabricator.wikimedia.org/T356241) [14:37:54] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2024'] [14:38:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2024'] [14:38:37] (03CR) 10Clément Goubert: [C:03+1] Use shellbox-video for videoscaling on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064389 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:39:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:14] (03PS1) 10Hnowlan: use shellbox-video for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) [14:41:44] (03CR) 10Ahmon Dancy: [C:03+2] mw-debug/mw-web: Reduce CPU requests/limits for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064124 (owner: 10Ahmon Dancy) [14:41:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ganeti2036.codfw.wmnet [14:42:02] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10081715 (10Andrew) p:05Triage→03Medium [14:42:21] (03Merged) 10jenkins-bot: sre.hosts: Add git paths on puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/1064346 (owner: 10Clément Goubert) [14:42:38] (03CR) 10Clément Goubert: [C:03+2] sre.hosts.move-vlan: Use puppetserver remote paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1064347 (owner: 10Clément Goubert) [14:42:46] (03Merged) 10jenkins-bot: mw-debug/mw-web: Reduce CPU requests/limits for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064124 (owner: 10Ahmon Dancy) [14:43:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2024.mgmt.codfw.wmnet with reboot policy FORCED [14:44:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P67437 and previous config saved to /var/cache/conftool/dbconfig/20240821-144405-ladsgroup.json [14:44:07] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10081717 (10jhathaway) >>! In T369314#9997206, @KFrancis wrote: > The NDA is out for signatures. I'll confirm when it's complete. @KFrancis has the NDA been completed? [14:44:11] (03CR) 10TheDJ: varnish: Add restrictive CSP to upload.wikimedia.org for testwiki only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [14:44:23] (03PS2) 10Dzahn: gerrit: create a temp insetup role to test java install in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1063904 (https://phabricator.wikimedia.org/T372804) [14:45:56] (03PS1) 10AikoChou: ml-services: bump memory for readability isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064391 (https://phabricator.wikimedia.org/T369712) [14:46:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T367856)', diff saved to https://phabricator.wikimedia.org/P67438 and previous config saved to /var/cache/conftool/dbconfig/20240821-144625-marostegui.json [14:46:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2163.codfw.wmnet with reason: Maintenance [14:46:31] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:46:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2163.codfw.wmnet with reason: Maintenance [14:46:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T367856)', diff saved to https://phabricator.wikimedia.org/P67439 and previous config saved to /var/cache/conftool/dbconfig/20240821-144648-marostegui.json [14:46:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2024.mgmt.codfw.wmnet with reboot policy FORCED [14:47:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10081718 (10cmooney) a:05cmooney→03None Work completed, no issues to report (although I had to downgrade the NIC firmw... [14:48:49] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542#10081748 (10Jhancock.wm) disk are all present and healthy. firmware is current on idrac and bios. (I see y'all did that recently) I didn't see any issues with the bios se... [14:53:24] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T372916#10081763 (10Jhancock.wm) [14:53:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10081764 (10jhathaway) [14:54:25] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T372916#10081757 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm np! things happen. looking out for each other. label has been changed! [14:55:27] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10081765 (10Clement_Goubert) [14:55:57] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T372916#10081766 (10Clement_Goubert) Thank you! [14:56:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10081767 (10jhathaway) [14:57:33] (03PS1) 10Scott French: kubernetes-wikikube: ignore shellbox-video unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1064392 (https://phabricator.wikimedia.org/T356241) [14:59:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P67440 and previous config saved to /var/cache/conftool/dbconfig/20240821-145912-ladsgroup.json [14:59:23] (03Merged) 10jenkins-bot: sre.hosts.move-vlan: Use puppetserver remote paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1064347 (owner: 10Clément Goubert) [14:59:34] (03CR) 10Andrea Denisse: [C:03+2] alert: Add the alert[12]002 hosts to acme chief [puppet] - 10https://gerrit.wikimedia.org/r/1064107 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [15:00:22] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542#10081780 (10bking) Sorry for not giving you that info earlier ;( . The reimage fails during the "installing" step...so when it's writing packages to the disk. That leads... [15:00:59] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10081785 (10CDanis) @Ladsgroup @Marosteg... [15:04:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:22] (03CR) 10Klausman: [C:03+1] ml-services: bump memory for readability isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064391 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [15:10:42] (03CR) 10FNegri: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1064388 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [15:12:47] (03CR) 10Clément Goubert: [C:03+1] use shellbox-video for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:13:09] (03CR) 10Hnowlan: [C:03+1] kubernetes-wikikube: ignore shellbox-video unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1064392 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [15:14:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T370903)', diff saved to https://phabricator.wikimedia.org/P67441 and previous config saved to /var/cache/conftool/dbconfig/20240821-151419-ladsgroup.json [15:14:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:14:23] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:14:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:14:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T370903)', diff saved to https://phabricator.wikimedia.org/P67442 and previous config saved to /var/cache/conftool/dbconfig/20240821-151441-ladsgroup.json [15:15:50] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10081831 (10lmata) @Jhancock.wm thank you! [15:18:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T370903)', diff saved to https://phabricator.wikimedia.org/P67443 and previous config saved to /var/cache/conftool/dbconfig/20240821-151759-ladsgroup.json [15:22:37] (03PS1) 10Btullis: cephosd: Update logical volume labels to match mounts [puppet] - 10https://gerrit.wikimedia.org/r/1064399 (https://phabricator.wikimedia.org/T372783) [15:24:29] (03CR) 10Scott French: [C:03+2] kubernetes-wikikube: ignore shellbox-video unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1064392 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [15:25:39] (03Merged) 10jenkins-bot: kubernetes-wikikube: ignore shellbox-video unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1064392 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [15:27:35] (03CR) 10Btullis: [C:03+2] cephosd: remove a subset of LVM signatures during reimage [puppet] - 10https://gerrit.wikimedia.org/r/1064388 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [15:31:05] (03CR) 10Btullis: [C:03+2] cephosd: Update logical volume labels to match mounts [puppet] - 10https://gerrit.wikimedia.org/r/1064399 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [15:33:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P67444 and previous config saved to /var/cache/conftool/dbconfig/20240821-153306-ladsgroup.json [15:35:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10081945 (10Jhancock.wm) [15:35:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10081944 (10Clement_Goubert) [15:36:23] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [15:37:22] (03CR) 10AikoChou: [C:03+2] ml-services: bump memory for readability isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064391 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [15:38:21] (03Merged) 10jenkins-bot: ml-services: bump memory for readability isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064391 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [15:48:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P67445 and previous config saved to /var/cache/conftool/dbconfig/20240821-154815-ladsgroup.json [15:48:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2035.codfw.wmnet with OS bookworm [15:49:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10082000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet... [15:50:02] (03CR) 10BCornwall: [C:03+1] dns: remove wdqs experimental endpoints [dns] - 10https://gerrit.wikimedia.org/r/1064355 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [15:56:04] oooh! [15:56:06] thanks [15:56:40] * MichaelG_WMF remembers there used to be a shell utility, but maybe that was on toolforge or cloudvps? [15:56:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:56:58] (03PS1) 10JHathaway: analytics_privatedata_users: add ifeatunnaobiwmde [puppet] - 10https://gerrit.wikimedia.org/r/1064404 (https://phabricator.wikimedia.org/T371796) [15:57:15] !log T372333, with I431d2aba14db9ab8931e21260cb2005c7276e2b8 checked out, running mwscript /home/migr/GrowthExperiments/maintenance/fixLinkRecommendationData.php --dry-run --wiki=testwiki --search-index --db-table [15:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:16] T372333: de.wikipedia: Add Link unavailable due to a high-number of dangling records - https://phabricator.wikimedia.org/T372333 [15:58:25] MichaelG_WMF: I also had vague memories of this but then I realized I think that was just a feature for pastebin.com [15:58:45] there's phaste for phabricator paste [15:59:00] ah, yes! [16:00:14] we have a bash sal logger for helmfile on deploy servers, we could probably deploy something like that to other servers if there's interest though [16:03:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T370903)', diff saved to https://phabricator.wikimedia.org/P67446 and previous config saved to /var/cache/conftool/dbconfig/20240821-160323-ladsgroup.json [16:03:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:03:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:03:40] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:03:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T370903)', diff saved to https://phabricator.wikimedia.org/P67447 and previous config saved to /var/cache/conftool/dbconfig/20240821-160345-ladsgroup.json [16:04:32] toolforge has the dologmsg command [16:05:32] (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: Remove duplicate sysuser creation [puppet] - 10https://gerrit.wikimedia.org/r/1062449 (owner: 10BCornwall) [16:05:51] phaste is so useful [16:06:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:07:14] (03PS3) 10Dzahn: gerrit: create a temp insetup role to test java install in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1063904 (https://phabricator.wikimedia.org/T372804) [16:08:18] AntiComposite: well apparently so does mwmaint [16:08:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T370903)', diff saved to https://phabricator.wikimedia.org/P67448 and previous config saved to /var/cache/conftool/dbconfig/20240821-160831-ladsgroup.json [16:08:39] (just checked) [16:08:49] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:15:17] Trying to remmber if this adds !log or not [16:16:42] (03CR) 10Dzahn: [C:03+2] gerrit: create a temp insetup role to test java install in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1063904 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [16:17:04] MichaelG_WMF: ok. the magic is `dologmsg` on mwmaint. You have to give it the whole message for IRC including the `!log` part. There are other wrappers built into things like scap that add the user@host bits when they call the relay. [16:17:27] * bd808_ had kind of forgotten how this all works [16:17:54] Aha! Good to know for next time, thanks :) [16:19:57] (03PS15) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [16:20:03] dologmsg goes way, way, way back too -- https://gerrit.wikimedia.org/r/c/operations/puppet/+/559 [16:21:41] (03PS1) 10Dduvall: mw-debug/mw-web: Reduce CPU requests/limits further for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 [16:23:30] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3715/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [16:23:36] (03CR) 10Ahmon Dancy: mw-debug/mw-web: Reduce CPU requests/limits further for train-dev (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 (owner: 10Dduvall) [16:23:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P67449 and previous config saved to /var/cache/conftool/dbconfig/20240821-162339-ladsgroup.json [16:24:31] (03PS1) 10Dzahn: gerrit: add firewall, java, scap, mail settings to Hiera for gerrit1004 [puppet] - 10https://gerrit.wikimedia.org/r/1064412 (https://phabricator.wikimedia.org/T372804) [16:24:42] (03CR) 10CI reject: [V:04-1] gerrit: add firewall, java, scap, mail settings to Hiera for gerrit1004 [puppet] - 10https://gerrit.wikimedia.org/r/1064412 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [16:24:49] (03CR) 10BCornwall: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [16:25:10] (03PS2) 10Dduvall: mw-debug/mw-web: Reduce CPU requests/limits further for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 [16:25:19] (03CR) 10Dduvall: mw-debug/mw-web: Reduce CPU requests/limits further for train-dev (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 (owner: 10Dduvall) [16:25:48] (03CR) 10Ahmon Dancy: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 (owner: 10Dduvall) [16:26:03] (03CR) 10Dduvall: [C:03+2] mw-debug/mw-web: Reduce CPU requests/limits further for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 (owner: 10Dduvall) [16:26:27] (03PS1) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [16:27:06] (03Merged) 10jenkins-bot: mw-debug/mw-web: Reduce CPU requests/limits further for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064411 (owner: 10Dduvall) [16:27:18] (03PS2) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [16:28:52] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [16:32:12] (03CR) 10Ssingh: [C:03+1] analytics_privatedata_users: add ifeatunnaobiwmde [puppet] - 10https://gerrit.wikimedia.org/r/1064404 (https://phabricator.wikimedia.org/T371796) (owner: 10JHathaway) [16:32:46] (03CR) 10JHathaway: [C:03+2] analytics_privatedata_users: add ifeatunnaobiwmde [puppet] - 10https://gerrit.wikimedia.org/r/1064404 (https://phabricator.wikimedia.org/T371796) (owner: 10JHathaway) [16:33:01] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10082140 (10jhathaway) 05Open→03Resolved added [16:33:08] 06SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for jhathaway - https://phabricator.wikimedia.org/T372663#10082142 (10jhathaway) 05Open→03Resolved a:03jhathaway [16:37:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10082154 (10jhathaway) 05Open→03Resolved merged [16:38:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P67450 and previous config saved to /var/cache/conftool/dbconfig/20240821-163846-ladsgroup.json [16:40:43] !log imported php-apcu_5.1.23-1+wmf11u1 into component/php81 - T372507 [16:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:53] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [16:41:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2035.codfw.wmnet with OS bookworm [16:41:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10082219 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet wit... [16:41:49] !log imported php-excimer_1.2.2-1+wmf11u1 into component/php81 - T372507 [16:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:45] !log imported php-imagick_3.7.0-6+wmf11u1 into component/php81 - T372507 [16:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:23] (03PS3) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [16:43:51] (03CR) 10Dzahn: "what it actually s" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [16:43:53] (03PS1) 10Ahmon Dancy: profile::kubernetes::deployment_server::mariadb_master_ips: Handle no match situation [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) [16:44:16] !log imported php-luasandbox_4.1.2-1+wmf11u1 into component/php81 - T372507 [16:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:26] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) (owner: 10Ahmon Dancy) [16:45:09] (03CR) 10Dzahn: "gotcha!" [puppet] - 10https://gerrit.wikimedia.org/r/1064065 (owner: 10Dzahn) [16:45:13] !log imported php-msgpack_2.2.0-4+wmf11u1 into component/php81 - T372507 [16:45:15] (03Abandoned) 10Dzahn: nftables_throttling: set a default value for burst parameter [puppet] - 10https://gerrit.wikimedia.org/r/1064065 (owner: 10Dzahn) [16:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:58] !log imported php-pcov_1.0.11-5+wmf11u1 into component/php81 - T372507 [16:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:07] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [16:46:20] (03CR) 10Dzahn: "yea, true. I will come up with a new name" [puppet] - 10https://gerrit.wikimedia.org/r/1064091 (owner: 10Dzahn) [16:46:21] (03CR) 10Dzahn: [C:03+2] acme_chief: remove outdated gerrit service names [puppet] - 10https://gerrit.wikimedia.org/r/1064091 (owner: 10Dzahn) [16:46:32] (03CR) 10CI reject: [V:04-1] profile::kubernetes::deployment_server::mariadb_master_ips: Handle no match situation [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) (owner: 10Ahmon Dancy) [16:46:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ganeti2036.codfw.wmnet [16:46:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ganeti2036.codfw.wmnet [16:47:48] (03PS2) 10Ahmon Dancy: profile::kubernetes::deployment_server::mariadb_master_ips: Handle no match [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) [16:47:50] (03PS1) 10MusikAnimal: Explicitly set font size in VisualEditor + CodeMirror 6 [skins/MinervaNeue] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064418 (https://phabricator.wikimedia.org/T357482) [16:48:04] (03Abandoned) 10MusikAnimal: Explicitly set font size in VisualEditor + CodeMirror 6 [skins/MinervaNeue] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064418 (https://phabricator.wikimedia.org/T357482) (owner: 10MusikAnimal) [16:48:10] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for GreyOlson - https://phabricator.wikimedia.org/T372934#10082257 (10jhathaway) [16:49:14] (03PS1) 10MusikAnimal: ve.ui.CodeMirrorAction.v6: use infinity viewport to avoid misalignment [extensions/CodeMirror] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064419 (https://phabricator.wikimedia.org/T357482) [16:49:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [16:50:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [16:50:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:50:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CodeMirror] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064419 (https://phabricator.wikimedia.org/T357482) (owner: 10MusikAnimal) [16:50:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:50:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T371742)', diff saved to https://phabricator.wikimedia.org/P67451 and previous config saved to /var/cache/conftool/dbconfig/20240821-165027-ladsgroup.json [16:50:31] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:51:23] (03CR) 10Ahmon Dancy: "PCC results are unchanged for deploy1003.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) (owner: 10Ahmon Dancy) [16:51:28] where has ScheduleDeploymentBot been all of my life! great work bd808 [16:51:47] (03Abandoned) 10Dzahn: idp:standalone: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1060901 (owner: 10Dzahn) [16:53:14] (03PS1) 10JHathaway: clinic-duty: Add Grey Olson to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064420 (https://phabricator.wikimedia.org/T372934) [16:53:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037658 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [16:53:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T370903)', diff saved to https://phabricator.wikimedia.org/P67452 and previous config saved to /var/cache/conftool/dbconfig/20240821-165353-ladsgroup.json [16:53:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [16:53:57] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:54:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1213.eqiad.wmnet with reason: Maintenance [16:54:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T370903)', diff saved to https://phabricator.wikimedia.org/P67453 and previous config saved to /var/cache/conftool/dbconfig/20240821-165415-ladsgroup.json [16:58:18] musikanimal: thanks! It was just waiting for someone to write the code this whole time. ;) [16:58:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T370903)', diff saved to https://phabricator.wikimedia.org/P67454 and previous config saved to /var/cache/conftool/dbconfig/20240821-165829-ladsgroup.json [16:59:47] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Remove unused browser security checks [puppet] - 10https://gerrit.wikimedia.org/r/1064125 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [17:00:55] (03PS1) 10JHathaway: clinic-duty: Add Halley Coplin to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064421 (https://phabricator.wikimedia.org/T372907) [17:02:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T371742)', diff saved to https://phabricator.wikimedia.org/P67455 and previous config saved to /var/cache/conftool/dbconfig/20240821-170206-ladsgroup.json [17:02:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:03:59] (03CR) 10Ssingh: [C:03+1] clinic-duty: Add Grey Olson to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064420 (https://phabricator.wikimedia.org/T372934) (owner: 10JHathaway) [17:04:28] (03CR) 10BCornwall: [C:03+1] clinic-duty: Add Halley Coplin to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064421 (https://phabricator.wikimedia.org/T372907) (owner: 10JHathaway) [17:06:03] (03CR) 10JHathaway: [C:03+2] clinic-duty: Add Halley Coplin to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064421 (https://phabricator.wikimedia.org/T372907) (owner: 10JHathaway) [17:07:10] (03CR) 10Dzahn: [C:03+1] "https://app.betterworks.com/app/#/profile/783293" [puppet] - 10https://gerrit.wikimedia.org/r/1064420 (https://phabricator.wikimedia.org/T372934) (owner: 10JHathaway) [17:08:15] (03PS2) 10JHathaway: clinic-duty: Add Grey Olson to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064420 (https://phabricator.wikimedia.org/T372934) [17:11:34] (03CR) 10JHathaway: [C:03+2] clinic-duty: Add Grey Olson to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1064420 (https://phabricator.wikimedia.org/T372934) (owner: 10JHathaway) [17:13:09] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for HCoplin - https://phabricator.wikimedia.org/T372907#10082427 (10jhathaway) 05Open→03Resolved a:03jhathaway [17:13:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P67457 and previous config saved to /var/cache/conftool/dbconfig/20240821-171337-ladsgroup.json [17:16:21] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for GreyOlson - https://phabricator.wikimedia.org/T372934#10082429 (10jhathaway) 05Open→03Resolved a:03jhathaway [17:17:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P67458 and previous config saved to /var/cache/conftool/dbconfig/20240821-171714-ladsgroup.json [17:19:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1017.eqiad.wmnet with OS bookworm [17:19:35] (03PS1) 10Ladsgroup: Change the disabled query page for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064424 (https://phabricator.wikimedia.org/T369024) [17:19:37] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm executed with errors: - an... [17:20:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ganeti2036.codfw.wmnet [17:24:30] jouncebot: nowandnext [17:24:30] For the next 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1700) [17:24:30] In 0 hour(s) and 35 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1800) [17:26:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064424 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [17:27:44] (03Merged) 10jenkins-bot: Change the disabled query page for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064424 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [17:28:05] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1064424|Change the disabled query page for commons (T369024)]] [17:28:08] T369024: SpecialUncategorizedPages slow query - https://phabricator.wikimedia.org/T369024 [17:28:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P67459 and previous config saved to /var/cache/conftool/dbconfig/20240821-172844-ladsgroup.json [17:30:22] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1064424|Change the disabled query page for commons (T369024)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:31:04] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:31:05] (03PS1) 10Eevans: beta: deployment-restbase05 as deployment target [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1064427 (https://phabricator.wikimedia.org/T370460) [17:32:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P67460 and previous config saved to /var/cache/conftool/dbconfig/20240821-173221-ladsgroup.json [17:32:23] (03CR) 10Eevans: [V:03+2 C:03+2] beta: deployment-restbase05 as deployment target [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1064427 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans) [17:34:02] !log imported php-wmerrors_2.0.0-1+wmf11u1 into component/php81 - T372507 [17:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:07] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [17:34:41] !log imported php-yaml_2.2.3-2+wmf11u1 into component/php81 - T372507 [17:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:50] (03PS2) 10Dzahn: gerrit: add firewall, java, scap, mail settings to Hiera for gerrit1004 [puppet] - 10https://gerrit.wikimedia.org/r/1064412 (https://phabricator.wikimedia.org/T372804) [17:35:42] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064424|Change the disabled query page for commons (T369024)]] (duration: 07m 36s) [17:35:45] T369024: SpecialUncategorizedPages slow query - https://phabricator.wikimedia.org/T369024 [17:35:50] (03PS3) 10Dzahn: gerrit: add firewall, java, scap, mail settings to Hiera for gerrit1004 [puppet] - 10https://gerrit.wikimedia.org/r/1064412 (https://phabricator.wikimedia.org/T372804) [17:35:50] !log imported tideways_5.0.4-16+wmf11u1 into component/php81 - T372507 [17:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:27] !log imported wikidiff2_1.14.1-2+wmf11u1 into component/php81 - T372507 [17:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:01] !log imported xdebug_3.3.2-1+wmf11u1 into component/php81 - T372507 [17:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:59] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1005.eqiad.wmnet with OS bookworm [17:40:11] (03CR) 10Andrea Denisse: [C:03+2] alert: Ensure the alert[12]002 hosts use the alerting_host role [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [17:43:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T370903)', diff saved to https://phabricator.wikimedia.org/P67461 and previous config saved to /var/cache/conftool/dbconfig/20240821-174351-ladsgroup.json [17:43:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [17:43:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:44:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [17:47:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T371742)', diff saved to https://phabricator.wikimedia.org/P67462 and previous config saved to /var/cache/conftool/dbconfig/20240821-174728-ladsgroup.json [17:47:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [17:47:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:47:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [17:47:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T371742)', diff saved to https://phabricator.wikimedia.org/P67463 and previous config saved to /var/cache/conftool/dbconfig/20240821-174750-ladsgroup.json [17:48:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:48:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:53:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:53:19] are these alerts related to the transition? [17:53:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:53:28] to root@, PROBLEM Service Alert: localhost/Total Processes is WARNING [17:54:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bookworm [17:54:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1017.eqiad.wmnet with OS bookworm [17:54:58] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082539 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1016.eqiad.wmnet with OS bookworm [17:55:00] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm [17:55:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1018.eqiad.wmnet with OS bookworm [17:55:06] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1018.eqiad.wmnet with OS bookworm [17:55:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1019.eqiad.wmnet with OS bookworm [17:55:09] (03PS1) 10Bking: WIP: Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) [17:55:10] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1019.eqiad.wmnet with OS bookworm [17:55:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bookworm [17:55:18] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082545 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1020.eqiad.wmnet with OS bookworm [17:56:03] !log imported php-igbinary_3.2.15-1+wmf11u1 into component/php81 - T372507 [17:56:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:06] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [17:56:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:56:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:56:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:56:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T370903)', diff saved to https://phabricator.wikimedia.org/P67464 and previous config saved to /var/cache/conftool/dbconfig/20240821-175638-ladsgroup.json [17:56:41] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:57:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:58:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T371742)', diff saved to https://phabricator.wikimedia.org/P67465 and previous config saved to /var/cache/conftool/dbconfig/20240821-175843-ladsgroup.json [17:58:47] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:00:04] andre and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T1800) [18:00:09] jouncebot: ain't nothing to deploy tonite but thanks [18:00:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T370903)', diff saved to https://phabricator.wikimedia.org/P67466 and previous config saved to /var/cache/conftool/dbconfig/20240821-180049-ladsgroup.json [18:02:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [18:02:34] (03CR) 10Andrea Denisse: [C:03+2] alert: Add the alert[12]002 hosts to Prometheus blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1064097 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [18:04:24] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [18:08:51] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1016.eqiad.wmnet with reason: host reimage [18:09:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1018.eqiad.wmnet with reason: host reimage [18:09:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1019.eqiad.wmnet with reason: host reimage [18:10:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1020.eqiad.wmnet with reason: host reimage [18:12:22] (03PS1) 10Dzahn: releases: upgrade Java JDK version from 11 to 17 [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) [18:12:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1016.eqiad.wmnet with reason: host reimage [18:13:42] !log imported php-redis_6.0.2-1+wmf11u1 into component/php81 - T372507 [18:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:47] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [18:13:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P67467 and previous config saved to /var/cache/conftool/dbconfig/20240821-181351-ladsgroup.json [18:15:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1020.eqiad.wmnet with reason: host reimage [18:15:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P67468 and previous config saved to /var/cache/conftool/dbconfig/20240821-181556-ladsgroup.json [18:16:24] (03CR) 10Dzahn: [V:03+1] "confirmed in compiler it actually installs the new package: https://puppet-compiler.wmflabs.org/output/1064437/3717/releases1003.eqiad.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [18:17:11] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10082614 (10CDanis) p:05Triage→03High [18:17:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1018.eqiad.wmnet with reason: host reimage [18:21:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1019.eqiad.wmnet with reason: host reimage [18:21:13] (03PS4) 10Bking: WIP: Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) [18:21:46] (03CR) 10CI reject: [V:04-1] WIP: Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [18:21:50] !log imported php-memcached_3.2.0++-1+wmf11u1 into component/php81 - T372507 [18:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:54] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [18:26:42] (03PS5) 10Bking: Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) [18:27:52] (03CR) 10CI reject: [V:04-1] Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [18:28:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P67469 and previous config saved to /var/cache/conftool/dbconfig/20240821-182858-ladsgroup.json [18:31:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P67470 and previous config saved to /var/cache/conftool/dbconfig/20240821-183104-ladsgroup.json [18:33:44] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:34:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:34:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1016.eqiad.wmnet with OS bookworm [18:34:13] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1016.eqiad.wmnet with OS bookworm completed: - an-presto1016... [18:36:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:36:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:36:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1020.eqiad.wmnet with OS bookworm [18:37:00] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1020.eqiad.wmnet with OS bookworm completed: - an-presto1020... [18:38:44] (03CR) 10AOkoth: "Yeah, it's a really long file. 😮" [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [18:39:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:40:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:40:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1018.eqiad.wmnet with OS bookworm [18:40:14] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1018.eqiad.wmnet with OS bookworm completed: - an-presto1018... [18:41:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1017.eqiad.wmnet with OS bookworm [18:42:06] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm executed with errors: - an... [18:43:03] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:43:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:43:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1019.eqiad.wmnet with OS bookworm [18:43:24] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1019.eqiad.wmnet with OS bookworm completed: - an-presto1019... [18:44:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T371742)', diff saved to https://phabricator.wikimedia.org/P67471 and previous config saved to /var/cache/conftool/dbconfig/20240821-184405-ladsgroup.json [18:44:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [18:44:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:44:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [18:44:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T371742)', diff saved to https://phabricator.wikimedia.org/P67472 and previous config saved to /var/cache/conftool/dbconfig/20240821-184427-ladsgroup.json [18:46:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T370903)', diff saved to https://phabricator.wikimedia.org/P67473 and previous config saved to /var/cache/conftool/dbconfig/20240821-184611-ladsgroup.json [18:46:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:46:20] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:46:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:46:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T370903)', diff saved to https://phabricator.wikimedia.org/P67474 and previous config saved to /var/cache/conftool/dbconfig/20240821-184633-ladsgroup.json [18:46:41] (03PS4) 10AOkoth: prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) [18:50:03] (03PS5) 10AOkoth: prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) [18:51:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1017.eqiad.wmnet with OS bookworm [18:51:44] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm [18:53:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T370903)', diff saved to https://phabricator.wikimedia.org/P67475 and previous config saved to /var/cache/conftool/dbconfig/20240821-185300-ladsgroup.json [18:53:04] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:54:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T371742)', diff saved to https://phabricator.wikimedia.org/P67476 and previous config saved to /var/cache/conftool/dbconfig/20240821-185452-ladsgroup.json [18:54:56] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:55:04] (03CR) 10AOkoth: "I think it's good now." [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [19:06:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1017.eqiad.wmnet with reason: host reimage [19:08:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P67477 and previous config saved to /var/cache/conftool/dbconfig/20240821-190807-ladsgroup.json [19:08:59] gerrit down for me or everyone else [19:09:05] sukhe: just came here to ask the same [19:09:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1017.eqiad.wmnet with reason: host reimage [19:09:50] down for me as well. [19:09:56] ok [19:09:59] looking [19:10:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P67478 and previous config saved to /var/cache/conftool/dbconfig/20240821-190959-ladsgroup.json [19:10:58] mutante, cdanis, sdeckelmann: Gerrit is down. [19:11:19] hi! [19:12:28] nvm seems to be back up. sorry for the ping! [19:13:02] back up indeed but what happened here I wonder [19:14:31] https://grafana.wikimedia.org/goto/GwtzqjjSR?orgId=1 [19:14:34] probably scraping [19:15:27] usually you'd see an increase in the threadpool stats as well [19:15:55] soo.. first of all.. I did look and it's back for me now [19:16:12] I see a bunch of IPs that have been throttled [19:16:39] (03CR) 10Xcollazo: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1064108 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:16:57] (03CR) 10JHathaway: [C:03+2] puppet8: add explicit typecast [puppet] - 10https://gerrit.wikimedia.org/r/1064108 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:20:33] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1056563 (https://phabricator.wikimedia.org/T370927) (owner: 10Cathal Mooney) [19:21:03] mutante: they went away by the time I looked, unless `nft list set inet filter 'DENYLIST'` is the wrong thing to be doing [19:21:48] cdanis: see -sec [19:21:53] cdanis: the command is right, they were there but only for 5 minutes [19:21:54] yeah just saw ty [19:23:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P67479 and previous config saved to /var/cache/conftool/dbconfig/20240821-192314-ladsgroup.json [19:24:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:25:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P67480 and previous config saved to /var/cache/conftool/dbconfig/20240821-192507-ladsgroup.json [19:30:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:30:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1017.eqiad.wmnet with OS bookworm [19:31:05] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm completed: - an-presto1017... [19:33:30] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082769 (10Jclark-ctr) [19:33:37] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10082770 (10Jclark-ctr) 05Open→03Resolved [19:36:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [19:38:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T370903)', diff saved to https://phabricator.wikimedia.org/P67481 and previous config saved to /var/cache/conftool/dbconfig/20240821-193821-ladsgroup.json [19:38:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:38:26] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:38:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:38:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T370903)', diff saved to https://phabricator.wikimedia.org/P67482 and previous config saved to /var/cache/conftool/dbconfig/20240821-193843-ladsgroup.json [19:40:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T371742)', diff saved to https://phabricator.wikimedia.org/P67483 and previous config saved to /var/cache/conftool/dbconfig/20240821-194014-ladsgroup.json [19:40:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [19:40:18] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:40:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [19:40:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T371742)', diff saved to https://phabricator.wikimedia.org/P67484 and previous config saved to /var/cache/conftool/dbconfig/20240821-194036-ladsgroup.json [19:44:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T370903)', diff saved to https://phabricator.wikimedia.org/P67485 and previous config saved to /var/cache/conftool/dbconfig/20240821-194445-ladsgroup.json [19:44:49] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:52:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T371742)', diff saved to https://phabricator.wikimedia.org/P67486 and previous config saved to /var/cache/conftool/dbconfig/20240821-195209-ladsgroup.json [19:52:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:56:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [19:57:59] (03PS1) 10Dreamy Jazz: Remove wgCheckUserPurgeOldClientHintsData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064458 (https://phabricator.wikimedia.org/T359560) [19:59:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P67487 and previous config saved to /var/cache/conftool/dbconfig/20240821-195952-ladsgroup.json [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T2000). [20:00:04] musikanimal: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] o/ [20:01:40] hi - i can deploy [20:03:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CodeMirror] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064419 (https://phabricator.wikimedia.org/T357482) (owner: 10MusikAnimal) [20:07:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:07:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P67488 and previous config saved to /var/cache/conftool/dbconfig/20240821-200716-ladsgroup.json [20:11:51] !log bearloga@deploy1003 Started deploy [airflow-dags/analytics_product@1856d12]: (no justification provided) [20:12:09] !log bearloga@deploy1003 Finished deploy [airflow-dags/analytics_product@1856d12]: (no justification provided) (duration: 00m 17s) [20:14:46] !log bearloga@deploy1003 Started deploy [airflow-dags/analytics_product@1856d12]: (no justification provided) [20:14:50] !log bearloga@deploy1003 Finished deploy [airflow-dags/analytics_product@1856d12]: (no justification provided) (duration: 00m 03s) [20:15:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P67489 and previous config saved to /var/cache/conftool/dbconfig/20240821-201500-ladsgroup.json [20:16:10] (03Merged) 10jenkins-bot: ve.ui.CodeMirrorAction.v6: use infinity viewport to avoid misalignment [extensions/CodeMirror] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064419 (https://phabricator.wikimedia.org/T357482) (owner: 10MusikAnimal) [20:16:30] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1064419|ve.ui.CodeMirrorAction.v6: use infinity viewport to avoid misalignment (T357482)]] [20:16:34] T357482: 2017 wikitext editor integration in CodeMirror 6 - https://phabricator.wikimedia.org/T357482 [20:21:11] !log cjming@deploy1003 musikanimal, cjming: Backport for [[gerrit:1064419|ve.ui.CodeMirrorAction.v6: use infinity viewport to avoid misalignment (T357482)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:17] hi musikanimal: 1st patch up on test servers if you'd like to verify - lmk if/when i should sync [20:21:24] checking now! [20:21:32] is that all mwdebug servers? [20:21:39] yes i believe so [20:22:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P67492 and previous config saved to /var/cache/conftool/dbconfig/20240821-202224-ladsgroup.json [20:23:54] okay, confirmed patch is merged. Feel free to sync cjming [20:24:35] err, confirmed that the patch does what it's supposed to, I mean [20:25:14] great - syncing [20:25:17] !log cjming@deploy1003 musikanimal, cjming: Continuing with sync [20:29:44] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064419|ve.ui.CodeMirrorAction.v6: use infinity viewport to avoid misalignment (T357482)]] (duration: 13m 14s) [20:29:48] T357482: 2017 wikitext editor integration in CodeMirror 6 - https://phabricator.wikimedia.org/T357482 [20:30:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T370903)', diff saved to https://phabricator.wikimedia.org/P67494 and previous config saved to /var/cache/conftool/dbconfig/20240821-203007-ladsgroup.json [20:30:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:30:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037658 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [20:30:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:30:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:30:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T370903)', diff saved to https://phabricator.wikimedia.org/P67495 and previous config saved to /var/cache/conftool/dbconfig/20240821-203029-ladsgroup.json [20:31:07] (03Merged) 10jenkins-bot: [beta] Enable CodeMirrorRTL on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037658 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [20:31:43] musikanimal: 1st patch should be live! 2nd is labs only so that's done too [20:31:57] awesome! thank you cjming ! [20:32:02] yw! [20:34:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T370903)', diff saved to https://phabricator.wikimedia.org/P67496 and previous config saved to /var/cache/conftool/dbconfig/20240821-203442-ladsgroup.json [20:35:07] !log end of UTC late backport window [20:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:39] (03PS1) 10Jforrester: wikifunctions: Upgrade staging evaluators from 2024-08-16-153209 to 2024-08-20-132618 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064466 (https://phabricator.wikimedia.org/T371837) [20:35:41] (03PS1) 10Jforrester: wikifunctions: Set some better default memory/CPU levels for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064467 (https://phabricator.wikimedia.org/T348681) [20:35:45] (03PS1) 10Jforrester: wikifunctions: Upgrade prod evaluators from 2024-08-16-153209 to 2024-08-20-132618 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064468 (https://phabricator.wikimedia.org/T371837) [20:37:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T371742)', diff saved to https://phabricator.wikimedia.org/P67497 and previous config saved to /var/cache/conftool/dbconfig/20240821-203731-ladsgroup.json [20:37:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [20:37:35] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:37:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [20:37:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T371742)', diff saved to https://phabricator.wikimedia.org/P67498 and previous config saved to /var/cache/conftool/dbconfig/20240821-203753-ladsgroup.json [20:38:01] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade staging evaluators from 2024-08-16-153209 to 2024-08-20-132618 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064466 (https://phabricator.wikimedia.org/T371837) (owner: 10Jforrester) [20:39:05] (03Merged) 10jenkins-bot: wikifunctions: Upgrade staging evaluators from 2024-08-16-153209 to 2024-08-20-132618 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064466 (https://phabricator.wikimedia.org/T371837) (owner: 10Jforrester) [20:40:13] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:40:19] (03PS8) 10Bking: Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) [20:41:00] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:48:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T371742)', diff saved to https://phabricator.wikimedia.org/P67499 and previous config saved to /var/cache/conftool/dbconfig/20240821-204802-ladsgroup.json [20:48:05] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:49:18] (03CR) 10Jforrester: [C:03+2] wikifunctions: Set some better default memory/CPU levels for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064467 (https://phabricator.wikimedia.org/T348681) (owner: 10Jforrester) [20:49:27] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P67500 and previous config saved to /var/cache/conftool/dbconfig/20240821-204948-ladsgroup.json [20:50:50] (03Merged) 10jenkins-bot: wikifunctions: Set some better default memory/CPU levels for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064467 (https://phabricator.wikimedia.org/T348681) (owner: 10Jforrester) [20:51:16] (03CR) 10Ryan Kemper: [C:03+1] Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [20:51:40] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:52:26] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:52:30] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:53:16] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:53:45] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:54:40] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:55:54] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade prod evaluators from 2024-08-16-153209 to 2024-08-20-132618 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064468 (https://phabricator.wikimedia.org/T371837) (owner: 10Jforrester) [20:56:58] (03Merged) 10jenkins-bot: wikifunctions: Upgrade prod evaluators from 2024-08-16-153209 to 2024-08-20-132618 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064468 (https://phabricator.wikimedia.org/T371837) (owner: 10Jforrester) [20:57:17] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:57:19] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:57:26] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:58:27] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:58:58] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:00:04] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240821T2100) [21:03:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P67501 and previous config saved to /var/cache/conftool/dbconfig/20240821-210309-ladsgroup.json [21:04:22] (03CR) 10Bking: [C:03+2] Add load alerts for stat hosts [alerts] - 10https://gerrit.wikimedia.org/r/1064436 (https://phabricator.wikimedia.org/T373046) (owner: 10Bking) [21:04:36] (03CR) 10Dzahn: [C:03+2] gerrit: add firewall, java, scap, mail settings to Hiera for gerrit1004 [puppet] - 10https://gerrit.wikimedia.org/r/1064412 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:04:55] (03CR) 10Dzahn: [C:03+2] "not affecting the production instance" [puppet] - 10https://gerrit.wikimedia.org/r/1064412 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:04:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P67502 and previous config saved to /var/cache/conftool/dbconfig/20240821-210455-ladsgroup.json [21:09:59] !log amastilovic@deploy1003 Started deploy [airflow-dags/analytics@1856d12]: (no justification provided) [21:11:35] !log amastilovic@deploy1003 Finished deploy [airflow-dags/analytics@1856d12]: (no justification provided) (duration: 01m 35s) [21:13:20] (03PS1) 10Ryan Kemper: Revert "wdqs graph-split: temp remove main/scholarly pools" [puppet] - 10https://gerrit.wikimedia.org/r/1064473 [21:14:57] (03CR) 10Ryan Kemper: [C:03+2] wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [21:16:30] (03CR) 10Ryan Kemper: [C:03+2] wdqs: store metadata about graph split type (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [21:18:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P67503 and previous config saved to /var/cache/conftool/dbconfig/20240821-211816-ladsgroup.json [21:20:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T370903)', diff saved to https://phabricator.wikimedia.org/P67504 and previous config saved to /var/cache/conftool/dbconfig/20240821-212002-ladsgroup.json [21:20:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance [21:20:08] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:20:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2192.codfw.wmnet with reason: Maintenance [21:20:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T370903)', diff saved to https://phabricator.wikimedia.org/P67505 and previous config saved to /var/cache/conftool/dbconfig/20240821-212024-ladsgroup.json [21:23:02] (03PS3) 10Ryan Kemper: wdqs: store graph type in data_loaded file [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [21:24:24] (03CR) 10Bking: [C:03+1] wdqs: store graph type in data_loaded file [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [21:24:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T370903)', diff saved to https://phabricator.wikimedia.org/P67506 and previous config saved to /var/cache/conftool/dbconfig/20240821-212425-ladsgroup.json [21:25:19] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling neither afterwards [21:25:23] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [21:26:23] (03Merged) 10jenkins-bot: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [21:33:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T371742)', diff saved to https://phabricator.wikimedia.org/P67507 and previous config saved to /var/cache/conftool/dbconfig/20240821-213323-ladsgroup.json [21:33:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [21:33:27] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:33:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [21:35:54] (03PS2) 10Ryan Kemper: wdqs: add graph split hosts to conftool_data [puppet] - 10https://gerrit.wikimedia.org/r/1064473 (https://phabricator.wikimedia.org/T364368) [21:38:59] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2022.codfw.wmnet w/ force delete existing files, repooling neither afterwards [21:39:04] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [21:39:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P67508 and previous config saved to /var/cache/conftool/dbconfig/20240821-213932-ladsgroup.json [21:41:37] !log amastilovic@deploy1003 Started deploy [airflow-dags/research@109c99e]: (no justification provided) [21:41:41] !log amastilovic@deploy1003 Finished deploy [airflow-dags/research@109c99e]: (no justification provided) (duration: 00m 03s) [21:41:59] !log amastilovic@deploy1003 Started deploy [airflow-dags/analytics_product@1856d12]: (no justification provided) [21:42:03] !log amastilovic@deploy1003 Finished deploy [airflow-dags/analytics_product@1856d12]: (no justification provided) (duration: 00m 03s) [21:42:12] !log amastilovic@deploy1003 Started deploy [airflow-dags/search@109c99e]: (no justification provided) [21:42:16] !log amastilovic@deploy1003 Finished deploy [airflow-dags/search@109c99e]: (no justification provided) (duration: 00m 03s) [21:42:22] !log amastilovic@deploy1003 Started deploy [airflow-dags/wmde@109c99e]: (no justification provided) [21:42:25] !log amastilovic@deploy1003 Finished deploy [airflow-dags/wmde@109c99e]: (no justification provided) (duration: 00m 03s) [21:44:26] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:28] (03CR) 10Bking: [C:03+1] wdqs: add graph split hosts to conftool_data [puppet] - 10https://gerrit.wikimedia.org/r/1064473 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [21:45:45] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add graph split hosts to conftool_data [puppet] - 10https://gerrit.wikimedia.org/r/1064473 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [21:49:01] (03PS1) 10Ryan Kemper: wdqs: put in alphabetical order [puppet] - 10https://gerrit.wikimedia.org/r/1064478 [21:49:27] (03CR) 10Bking: [C:03+1] "be the change you want to see in the world!" [puppet] - 10https://gerrit.wikimedia.org/r/1064478 (owner: 10Ryan Kemper) [21:49:34] (03CR) 10Ryan Kemper: [C:03+2] wdqs: put in alphabetical order [puppet] - 10https://gerrit.wikimedia.org/r/1064478 (owner: 10Ryan Kemper) [21:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:55] (03PS1) 10Ryan Kemper: wdqs: new -main, -scholarly services [puppet] - 10https://gerrit.wikimedia.org/r/1064479 (https://phabricator.wikimedia.org/T364368) [21:54:17] (03CR) 10Bking: [C:03+1] wdqs: new -main, -scholarly services [puppet] - 10https://gerrit.wikimedia.org/r/1064479 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [21:54:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P67509 and previous config saved to /var/cache/conftool/dbconfig/20240821-215440-ladsgroup.json [21:55:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [21:55:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [21:55:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T371742)', diff saved to https://phabricator.wikimedia.org/P67510 and previous config saved to /var/cache/conftool/dbconfig/20240821-215537-ladsgroup.json [21:55:41] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:59:30] (03PS1) 10Andrew Bogott: openstack keystone: add a new auth plugin to validate totp tokens against idm [puppet] - 10https://gerrit.wikimedia.org/r/1064480 (https://phabricator.wikimedia.org/T359551) [21:59:32] (03PS1) 10Andrew Bogott: openstack keystone: switch to idmtotp for 2fa [puppet] - 10https://gerrit.wikimedia.org/r/1064481 (https://phabricator.wikimedia.org/T359551) [22:02:35] (03PS19) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [22:02:35] (03CR) 10CDobbins: "This is still WIP, but I wanted to solicit feedback on this version thus far. All functionality has been added, with the exception of writ" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [22:09:08] (03CR) 10Andrew Bogott: [C:04-1] "This is entirely untested!" [puppet] - 10https://gerrit.wikimedia.org/r/1064480 (https://phabricator.wikimedia.org/T359551) (owner: 10Andrew Bogott) [22:09:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling neither afterwards [22:09:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T371742)', diff saved to https://phabricator.wikimedia.org/P67511 and previous config saved to /var/cache/conftool/dbconfig/20240821-220915-ladsgroup.json [22:09:16] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [22:09:21] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:09:38] (03CR) 10Andrew Bogott: [C:04-2] "We won't be ready to merge this for a while" [puppet] - 10https://gerrit.wikimedia.org/r/1064481 (https://phabricator.wikimedia.org/T359551) (owner: 10Andrew Bogott) [22:09:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T370903)', diff saved to https://phabricator.wikimedia.org/P67512 and previous config saved to /var/cache/conftool/dbconfig/20240821-220947-ladsgroup.json [22:09:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2201.codfw.wmnet with reason: Maintenance [22:09:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:10:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2201.codfw.wmnet with reason: Maintenance [22:14:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2211.codfw.wmnet with reason: Maintenance [22:14:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2211.codfw.wmnet with reason: Maintenance [22:14:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T370903)', diff saved to https://phabricator.wikimedia.org/P67514 and previous config saved to /var/cache/conftool/dbconfig/20240821-221450-ladsgroup.json [22:14:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:20:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T370903)', diff saved to https://phabricator.wikimedia.org/P67515 and previous config saved to /var/cache/conftool/dbconfig/20240821-222028-ladsgroup.json [22:20:32] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:22:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:24:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P67516 and previous config saved to /var/cache/conftool/dbconfig/20240821-222422-ladsgroup.json [22:30:06] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-main journal) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2022.codfw.wmnet w/ force delete existing files, repooling neither afterwards [22:30:14] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [22:30:46] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P67517 and previous config saved to /var/cache/conftool/dbconfig/20240821-223535-ladsgroup.json [22:35:46] RESOLVED: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:39:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P67518 and previous config saved to /var/cache/conftool/dbconfig/20240821-223929-ladsgroup.json [22:50:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P67519 and previous config saved to /var/cache/conftool/dbconfig/20240821-225042-ladsgroup.json [22:52:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2035.codfw.wmnet with OS bookworm [22:52:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet... [22:54:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T371742)', diff saved to https://phabricator.wikimedia.org/P67520 and previous config saved to /var/cache/conftool/dbconfig/20240821-225436-ladsgroup.json [22:54:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:54:40] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:54:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [23:02:13] (03CR) 10Bking: [C:03+1] datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:05:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T370903)', diff saved to https://phabricator.wikimedia.org/P67521 and previous config saved to /var/cache/conftool/dbconfig/20240821-230549-ladsgroup.json [23:05:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2213.codfw.wmnet with reason: Maintenance [23:05:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:05:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2213.codfw.wmnet with reason: Maintenance [23:06:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T370903)', diff saved to https://phabricator.wikimedia.org/P67522 and previous config saved to /var/cache/conftool/dbconfig/20240821-230600-ladsgroup.json [23:09:57] (03CR) 10Bking: [C:03+1] spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:10:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T370903)', diff saved to https://phabricator.wikimedia.org/P67523 and previous config saved to /var/cache/conftool/dbconfig/20240821-231037-ladsgroup.json [23:13:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [23:13:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [23:14:09] (03CR) 10Bking: [C:03+1] superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:18:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2035.codfw.wmnet with reason: host reimage [23:20:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2036.codfw.wmnet with OS bookworm [23:20:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2037.codfw.wmnet with OS bookworm [23:20:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2038.codfw.wmnet with OS bookworm [23:20:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2036.codfw.wmnet... [23:20:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2037.codfw.wmnet... [23:20:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2038.codfw.wmnet... [23:22:07] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [23:22:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2035.codfw.wmnet with reason: host reimage [23:23:00] (03CR) 10Bking: [C:03+1] airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:23:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [23:23:22] (03CR) 10Bking: [C:03+1] growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:23:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [23:23:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T371742)', diff saved to https://phabricator.wikimedia.org/P67524 and previous config saved to /var/cache/conftool/dbconfig/20240821-232341-ladsgroup.json [23:23:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2039.codfw.wmnet with OS bookworm [23:23:50] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:23:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2039.codfw.wmnet... [23:24:30] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [23:25:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P67525 and previous config saved to /var/cache/conftool/dbconfig/20240821-232544-ladsgroup.json [23:27:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [23:27:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [23:27:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [23:35:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2037.codfw.wmnet with reason: host reimage [23:37:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:37:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:37:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2035.codfw.wmnet with OS bookworm [23:37:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet wit... [23:38:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2036.codfw.wmnet with reason: host reimage [23:38:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2038.codfw.wmnet with reason: host reimage [23:38:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2037.codfw.wmnet with reason: host reimage [23:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1064495 [23:38:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1064495 (owner: 10TrainBranchBot) [23:40:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2039.codfw.wmnet with reason: host reimage [23:40:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P67526 and previous config saved to /var/cache/conftool/dbconfig/20240821-234051-ladsgroup.json [23:41:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2036.codfw.wmnet with reason: host reimage [23:42:24] (03CR) 10Bking: [C:03+1] cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:42:37] (03CR) 10Bking: [C:03+1] cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [23:44:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2039.codfw.wmnet with reason: host reimage [23:47:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2038.codfw.wmnet with reason: host reimage [23:48:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T371742)', diff saved to https://phabricator.wikimedia.org/P67527 and previous config saved to /var/cache/conftool/dbconfig/20240821-234808-ladsgroup.json [23:48:12] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:53:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:53:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:53:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2037.codfw.wmnet with OS bookworm [23:54:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083308 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2037.codfw.wmnet wit... [23:54:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10083309 (10VRiley-WMF) Calling back into dell for this ticket. It was supposed to have 1 day shipping, however has not yet arrived. [23:55:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [23:55:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T370903)', diff saved to https://phabricator.wikimedia.org/P67528 and previous config saved to /var/cache/conftool/dbconfig/20240821-235559-ladsgroup.json [23:56:02] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:56:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1298.eqiad.wmnet with OS bullseye [23:56:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10083332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bull... [23:56:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:58:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:58:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2036.codfw.wmnet with OS bookworm [23:59:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2036.codfw.wmnet wit... [23:59:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"