[00:07:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1066131 (owner: 10TrainBranchBot)
[02:17:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:29:27] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:39:27] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:47:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[02:59:27] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:02:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:16:00] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1066431
[05:19:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1066431 (owner: 10Marostegui)
[05:29:08] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10090462 (10Joe) For the record, the reason we wanted to support large file uploads was not to worsen the performance of upload-by-url, which has since been fixed by makin...
[05:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:57:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:03:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:06:31] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans)
[06:08:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:08:36] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 32934
[06:11:50] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451
[06:11:54] <wikibugs>	 (03CR) 10Arnaudb: "this is to avoid noisy alerting while backups are performed." [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (owner: 10Arnaudb)
[06:16:14] <wikibugs>	 (03CR) 10Slyngshede: "Sorry, I got distracted. Merged patches are automatically deployed to idm-test.wikimedia.org, for testing, and only later released as part" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński)
[06:16:24] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934
[06:19:29] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991)
[06:20:57] <wikibugs>	 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10090530 (10ABran-WMF) p:05High→03Med...
[06:21:02] <wikibugs>	 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10090532 (10ABran-WMF) p:05Medium→03H...
[06:29:28] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:47:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[06:59:37] <wikibugs>	 (03PS3) 10Wangombe: Update reference to ElasticSearchTtmServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054869 (https://phabricator.wikimedia.org/T335342)
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T0700). nyaa~
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:07:49] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: add new revertrisk isvcs for pre-save context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou)
[07:14:28] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es7 T373168
[07:14:31] <stashbot>	 T373168: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T373168
[07:14:34] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T373168
[07:15:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es2039 with weight 0 T373168', diff saved to https://phabricator.wikimedia.org/P67755 and previous config saved to /var/cache/conftool/dbconfig/20240826-071504-arnaudb.json
[07:17:54] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1065126 (https://phabricator.wikimedia.org/T373168) (owner: 10Gerrit maintenance bot)
[07:19:12] <arnaudb>	 !log Starting es7 codfw failover from es2038 to es2039 - T373168
[07:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write T373168', diff saved to https://phabricator.wikimedia.org/P67756 and previous config saved to /var/cache/conftool/dbconfig/20240826-072028-arnaudb.json
[07:20:32] <stashbot>	 T373168: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T373168
[07:21:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'rebalance weights T373168', diff saved to https://phabricator.wikimedia.org/P67757 and previous config saved to /var/cache/conftool/dbconfig/20240826-072119-arnaudb.json
[07:22:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove frdb2001 frqueue2001 payments2003 [puppet] - 10https://gerrit.wikimedia.org/r/1064942 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt)
[07:32:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:34:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-platform: fix deploy tags for stat_host [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046)
[07:35:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "AlertLintProblem meta-alert signaled that node_load15 is missing from 'analytics' instance" [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi)
[07:40:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173
[07:40:20] <stashbot>	 T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173
[07:40:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173
[07:41:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2203 with weight 0 T373173', diff saved to https://phabricator.wikimedia.org/P67758 and previous config saved to /var/cache/conftool/dbconfig/20240826-074113-arnaudb.json
[08:07:35] <wikibugs>	 (03CR) 10Marostegui: "Can you make a comment on top so in the future we know what this exception is for?" [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[08:11:21] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Remove old CAS 6.6 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede)
[08:13:10] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth)
[08:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:17:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 depool', diff saved to https://phabricator.wikimedia.org/P67760 and previous config saved to /var/cache/conftool/dbconfig/20240826-081753-arnaudb.json
[08:22:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Primary switchover s1 node in failure
[08:22:54] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Primary switchover s1 node in failure
[08:25:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:29:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:36:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:36:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:40:18] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp2003.wikimedia.org
[08:41:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:45:08] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[08:46:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:47:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: remove x509ignoreCN=0 from blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066685 (https://phabricator.wikimedia.org/T326657)
[08:48:15] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[08:48:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173 - repeat due to T373295
[08:48:45] <stashbot>	 T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173
[08:48:46] <stashbot>	 T373295: reimage db2176 - https://phabricator.wikimedia.org/T373295
[08:49:12] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173 - repeat due to T373295
[08:49:39] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[08:49:39] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:49:40] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2003.wikimedia.org
[08:49:59] <arnaudb>	 !log Starting s1 codfw failover from db2212 to db2203 - T373173
[08:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T373173', diff saved to https://phabricator.wikimedia.org/P67762 and previous config saved to /var/cache/conftool/dbconfig/20240826-085048-arnaudb.json
[08:51:38] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1065132 (https://phabricator.wikimedia.org/T373173) (owner: 10Gerrit maintenance bot)
[08:55:29] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[08:56:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'weight db2212 T373173', diff saved to https://phabricator.wikimedia.org/P67763 and previous config saved to /var/cache/conftool/dbconfig/20240826-085621-arnaudb.json
[08:56:25] <stashbot>	 T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173
[09:01:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:06:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:08:37] <wikibugs>	 (03PS1) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589)
[09:11:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi)
[09:13:52] <wikibugs>	 (03PS2) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589)
[09:17:50] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp1003.wikimedia.org
[09:20:11] <wikibugs>	 (03CR) 10Ayounsi: "Script can be tested over there https://netbox-next.wikimedia.org/extras/scripts/37/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi)
[09:21:39] <wikibugs>	 (03CR) 10Ayounsi: "See related task, let me know if you think it would be useful. Otherwise I'd be ok to close the task as declined." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi)
[09:22:43] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[09:23:21] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] scap.cfg.erb: Enable require_tty_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy)
[09:25:59] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[09:27:04] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[09:27:04] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:27:05] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1003.wikimedia.org
[09:28:50] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp-test1003.wikimedia.org
[09:29:27] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:30:45] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066699
[09:32:07] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Configure caching for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063765 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[09:32:13] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991)
[09:32:25] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066699 (owner: 10Jgiannelos)
[09:32:42] <wikibugs>	 (03CR) 10Arnaudb: "done!" [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[09:33:14] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Configure caching for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063765 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[09:33:33] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[09:33:35] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064721 (owner: 10PipelineBot)
[09:33:39] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063791 (owner: 10PipelineBot)
[09:33:42] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062999 (owner: 10PipelineBot)
[09:33:48] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061985 (owner: 10PipelineBot)
[09:33:54] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066699 (owner: 10Jgiannelos)
[09:33:54] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056927 (owner: 10PipelineBot)
[09:34:15] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055912 (owner: 10PipelineBot)
[09:34:19] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054328 (owner: 10PipelineBot)
[09:34:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "This is fine to address the noise, but I've added Jaime as CC to see if he wants to address this in some other way too." [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[09:34:24] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052723 (owner: 10PipelineBot)
[09:34:28] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052102 (owner: 10PipelineBot)
[09:35:08] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[09:36:51] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[09:37:09] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002"
[09:37:09] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:37:09] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test1003.wikimedia.org
[09:37:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Allow the selection of any vlan in provision server script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064387 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney)
[09:39:43] <wikibugs>	 (03Merged) 10jenkins-bot: Allow the selection of any vlan in provision server script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064387 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney)
[09:40:30] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[09:42:29] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[09:42:39] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[09:43:06] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[09:45:36] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[09:46:40] <wikibugs>	 (03PS2) 10Cathal Mooney: Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458
[09:47:20] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[09:51:55] <wikibugs>	 (03PS1) 10Slyngshede: P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997)
[09:55:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[09:55:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney)
[09:57:22] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3741/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede)
[09:59:55] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1000)
[10:00:56] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[10:02:22] <wikibugs>	 (03PS2) 10Slyngshede: P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997)
[10:19:27] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use IPs instead of hostname for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711
[10:24:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:27:08] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[10:27:41] <wikibugs>	 (03CR) 10Jgiannelos: "This was generated by:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (owner: 10Jgiannelos)
[10:29:28] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:23] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314)
[10:30:45] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:32:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:39:54] <Dreamy_Jazz>	 !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[10:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:45:15] <wikibugs>	 (03PS2) 10Gmodena: data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768)
[10:46:34] <Dreamy_Jazz>	 !log Started a maximum 6 hr scan on ruwiki for MediaModeration - https://wikitech.wikimedia.org/wiki/MediaModeration
[10:46:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[10:48:27] <wikibugs>	 (03PS8) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083)
[10:50:34] <wikibugs>	 (03PS9) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083)
[10:52:38] <wikibugs>	 (03PS10) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083)
[11:02:46] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate webperf.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:08:38] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[11:10:34] <wikibugs>	 (03PS4) 10Hnowlan: scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048)
[11:11:08] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] admin: add new ssh key for ngkountas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1065216 (https://phabricator.wikimedia.org/T371372) (owner: 10Nik Gkountas)
[11:11:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan)
[11:13:57] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[11:16:18] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[11:16:48] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[11:17:46] <wikibugs>	 (03PS1) 10JMeybohm: eventgate: Offer readinessProbe that does not test kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192)
[11:17:48] <wikibugs>	 (03PS1) 10JMeybohm: eventgate-main: Disable end-to-end readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192)
[11:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[11:18:23] <wikibugs>	 (03CR) 10Hnowlan: "Sorry for the extra trouble, but for future debugging could you add the associated hostnames as comments?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos)
[11:19:27] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:25:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:25:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:27:02] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[11:27:15] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[11:27:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:27:32] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:27:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T370903)', diff saved to https://phabricator.wikimedia.org/P67766 and previous config saved to /var/cache/conftool/dbconfig/20240826-112739-ladsgroup.json
[11:27:42] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[11:28:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T370903)', diff saved to https://phabricator.wikimedia.org/P67767 and previous config saved to /var/cache/conftool/dbconfig/20240826-112847-ladsgroup.json
[11:29:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[11:30:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[11:30:33] <wikibugs>	 (03PS1) 10Ayounsi: Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036)
[11:30:35] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036)
[11:30:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[11:31:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena)
[11:32:21] <wikibugs>	 (03PS2) 10Ayounsi: Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036)
[11:32:21] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036)
[11:33:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM! Nicely done, see inline re: dashboard and other than that this is ready to go I think" [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli)
[11:33:53] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[11:34:06] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "Self-merging as it's netbox-next" [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[11:36:24] <wikibugs>	 (03PS1) 10Slyngshede: Test Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1066723
[11:41:10] <logmsgbot>	 !log hashar@deploy1003 Started deploy [integration/docroot@c3352dd]: build: update mediawiki/mediawiki-codesniffer to 44.0.0 and micromatch to 4.0.8
[11:41:16] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [integration/docroot@c3352dd]: build: update mediawiki/mediawiki-codesniffer to 44.0.0 and micromatch to 4.0.8 (duration: 00m 06s)
[11:43:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Patch LGTM, nicely done! Please note that the 'corto' Debian package will need to be uploaded to apt.w.o before this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall)
[11:43:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P67768 and previous config saved to /var/cache/conftool/dbconfig/20240826-114354-ladsgroup.json
[11:48:03] <wikibugs>	 (03CR) 10Ayounsi: [V:03+1] "tested on netbox-next." [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[11:51:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) (owner: 10Ayounsi)
[11:52:36] <wikibugs>	 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10091553 (10ayounsi) Deployed on netbox-next and tests seem all good.
[11:53:06] <wikibugs>	 (03Merged) 10jenkins-bot: IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) (owner: 10Ayounsi)
[11:53:34] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[11:53:39] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[11:54:12] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[11:59:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P67769 and previous config saved to /var/cache/conftool/dbconfig/20240826-115901-ladsgroup.json
[11:59:54] <wikibugs>	 (03CR) 10Jgiannelos: "Sure thing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos)
[12:00:45] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:04:15] <wikibugs>	 (03PS3) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589)
[12:04:28] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:05:23] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[12:08:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s6 T373174
[12:08:58] <stashbot>	 T373174: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T373174
[12:09:14] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s6 T373174
[12:09:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2129 with weight 0 T373174', diff saved to https://phabricator.wikimedia.org/P67770 and previous config saved to /var/cache/conftool/dbconfig/20240826-120921-arnaudb.json
[12:12:56] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[12:14:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T370903)', diff saved to https://phabricator.wikimedia.org/P67771 and previous config saved to /var/cache/conftool/dbconfig/20240826-121408-ladsgroup.json
[12:14:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[12:14:12] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:14:13] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[12:14:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67772 and previous config saved to /var/cache/conftool/dbconfig/20240826-121419-ladsgroup.json
[12:14:30] <wikibugs>	 (03PS3) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314)
[12:15:18] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10091611 (10ayounsi) 05Open→03Resolved Validator deployed.
[12:16:13] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 274607
[12:16:26] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 274607
[12:16:30] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 269115
[12:16:42] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269115
[12:17:05] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61754
[12:17:19] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61754
[12:17:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 263903
[12:17:38] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263903
[12:17:48] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268434
[12:18:06] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268434
[12:18:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67773 and previous config saved to /var/cache/conftool/dbconfig/20240826-121828-ladsgroup.json
[12:20:43] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066744
[12:21:35] <godog>	 !log move to /root unused and about to expire cert on puppetmaster1001:/var/lib/puppet/server/ssl/ca/signed/webperf.discovery.wmnet.pem
[12:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:04] <wikibugs>	 (03PS1) 10Marostegui: test-s4: Add two new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1066749
[12:23:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] test-s4: Add two new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1066749 (owner: 10Marostegui)
[12:25:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: Testing
[12:25:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: Testing
[12:25:48] <wikibugs>	 (03PS1) 10Slyngshede: MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750
[12:27:37] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1065133 (https://phabricator.wikimedia.org/T373174) (owner: 10Gerrit maintenance bot)
[12:27:46] <jinxer-wm>	 RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate webperf.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:28:41] <arnaudb>	 !log Starting s6 codfw failover from db2214 to db2129 - T373174
[12:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:45] <stashbot>	 T373174: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T373174
[12:29:16] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1066685 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi)
[12:29:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary T373174', diff saved to https://phabricator.wikimedia.org/P67774 and previous config saved to /var/cache/conftool/dbconfig/20240826-122925-arnaudb.json
[12:30:01] <wikibugs>	 (03PS4) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314)
[12:31:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: remove x509ignoreCN=0 from blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066685 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi)
[12:31:51] <wikibugs>	 (03PS5) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314)
[12:32:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Weight db2214 T373174', diff saved to https://phabricator.wikimedia.org/P67775 and previous config saved to /var/cache/conftool/dbconfig/20240826-123205-arnaudb.json
[12:33:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P67776 and previous config saved to /var/cache/conftool/dbconfig/20240826-123336-ladsgroup.json
[12:34:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db2232 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1066752
[12:34:28] <jinxer-wm>	 RESOLVED: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:34:53] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1066753 (https://phabricator.wikimedia.org/T373330)
[12:35:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add db2232 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1066752 (owner: 10Marostegui)
[12:43:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091725 (10ABran-WMF) preparation job with the first few critical instances on the path is done for now. I'll have a few host to mo...
[12:43:29] <wikibugs>	 (03PS1) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098)
[12:46:06] <godog>	 jouncebot: now and next
[12:46:06] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[12:48:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P67777 and previous config saved to /var/cache/conftool/dbconfig/20240826-124843-ladsgroup.json
[12:48:56] <wikibugs>	 (03CR) 10Brouberol: "Note that this requires that the DAGs are injected at runtime, as the stastd client class is `wmf_airflow_common.metrics.custom_statsd_cli" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol)
[12:56:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091780 (10ABran-WMF) this task depends on: T373175
[12:57:11] <wikibugs>	 (03PS2) 10Hnowlan: shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241)
[12:57:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091785 (10ABran-WMF)
[12:57:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091786 (10ABran-WMF)
[12:59:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10091794 (10ABran-WMF)
[12:59:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091795 (10ABran-WMF)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1300).
[13:00:05] <jouncebot>	 ihurbain and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <urbanecm>	 👋
[13:00:16] <ihurbain>	 o/
[13:00:24] <hnowlan>	 o/
[13:00:46] <hnowlan>	 just fyi: one of my patches is a noop, just adding a script to mediawiki-config and I was a little unsure about process. 
[13:00:50] <urbanecm>	 i can deploy today
[13:01:08] <hnowlan>	 the other will (similar to previous ones) only take effect once it hits prod 
[13:01:08] <ihurbain>	 i need a deployer today, we have synergies then :D
[13:01:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[13:02:04] <urbanecm>	 hnowlan: scheduling like you did is fine :). although having a +1 from someone would be nice, given your patch adds config. i'm curious about why is it under `scripts` (as opposed to under `rpc`, together with the other one). 
[13:02:26] <urbanecm>	 (ah, +1s are in history, just not on the latest PS)
[13:02:37] <hnowlan>	 urbanecm: good question. this script is explicitly *not* an RPC script, it'll only be invoked via shell 
[13:03:10] <urbanecm>	 fair enough, that makes sense. let's do it then.
[13:03:12] <hnowlan>	 at a later point as part of a Kubernetes Job object 
[13:03:20] <hnowlan>	 thanks!
[13:03:23] <wikibugs>	 (03PS2) 10Isabelle Hurbain-Palatin: Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871)
[13:03:26] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:03:34] <wikibugs>	 (03PS5) 10Hnowlan: scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048)
[13:03:37] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan)
[13:03:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67778 and previous config saved to /var/cache/conftool/dbconfig/20240826-130350-ladsgroup.json
[13:03:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[13:03:54] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[13:03:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[13:04:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T370903)', diff saved to https://phabricator.wikimedia.org/P67779 and previous config saved to /var/cache/conftool/dbconfig/20240826-130401-ladsgroup.json
[13:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[13:05:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T370903)', diff saved to https://phabricator.wikimedia.org/P67780 and previous config saved to /var/cache/conftool/dbconfig/20240826-130510-ladsgroup.json
[13:05:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[13:06:08] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[13:06:27] <wikibugs>	 (03Merged) 10jenkins-bot: Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:06:29] <wikibugs>	 (03Merged) 10jenkins-bot: scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan)
[13:06:56] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[13:07:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[13:08:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:08:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan)
[13:09:31] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1064795|Rollout Parsoid Kartographer support on all wikis (T342871)]], [[gerrit:1059394|scripts: add script for running jobs from stdin rather than http (T369048)]]
[13:09:36] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:09:36] <stashbot>	 T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048
[13:10:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: SystemdUnitFailed as warning for data-persitence [puppet] - 10https://gerrit.wikimedia.org/r/1066762 (https://phabricator.wikimedia.org/T357333)
[13:17:49] <urbanecm>	 scap is still scapping :/
[13:20:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P67781 and previous config saved to /var/cache/conftool/dbconfig/20240826-132016-ladsgroup.json
[13:21:04] <wikibugs>	 (03PS1) 10Ayounsi: site.pp: extend rpki host regex to 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066770
[13:22:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] site.pp: extend rpki host regex to 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066770 (owner: 10Ayounsi)
[13:22:34] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] site.pp: extend rpki host regex to 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066770 (owner: 10Ayounsi)
[13:23:34] <wikibugs>	 (03CR) 10TChin: [C:03+1] data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena)
[13:24:00] <logmsgbot>	 !log urbanecm@deploy1003 hnowlan, urbanecm, ihurbain: Backport for [[gerrit:1064795|Rollout Parsoid Kartographer support on all wikis (T342871)]], [[gerrit:1059394|scripts: add script for running jobs from stdin rather than http (T369048)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:24:04] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:24:05] <stashbot>	 T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048
[13:24:07] <urbanecm>	 Finally
[13:24:08] <ihurbain>	 aha
[13:24:11] <urbanecm>	 ihurbain: can you test, please?
[13:24:15] <ihurbain>	 yup, doing that
[13:24:35] <urbanecm>	 hnowlan: i presume your patch can go ahead right away. unless you want to do sth at mwdebug while it's there?
[13:24:51] <urbanecm>	 (the script one)
[13:26:08] <hnowlan>	 urbanecm: nope, go ahead thanks
[13:26:12] <urbanecm>	 will do
[13:27:18] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[13:27:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[13:27:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67782 and previous config saved to /var/cache/conftool/dbconfig/20240826-132738-ladsgroup.json
[13:27:42] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[13:28:00] <ihurbain>	 urbanecm: ship it
[13:28:29] <logmsgbot>	 !log urbanecm@deploy1003 hnowlan, urbanecm, ihurbain: Continuing with sync
[13:28:33] <urbanecm>	 syncing!
[13:28:37] <ihurbain>	 woot!
[13:29:36] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host rpki2003.codfw.wmnet
[13:29:38] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host rpki2003.codfw.wmnet
[13:30:45] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host rpki2003.codfw.wmnet
[13:30:47] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[13:30:48] <godog>	 urbanecm: hi, would you mind pinging me once you are done with the deployment? thank you!
[13:31:00] <urbanecm>	 godog: hello! no problem, will do
[13:31:09] <godog>	 cheers
[13:31:37] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066772
[13:32:06] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066773
[13:34:00] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM rpki2003.codfw.wmnet - ayounsi@cumin1002"
[13:34:04] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM rpki2003.codfw.wmnet - ayounsi@cumin1002"
[13:34:04] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:34:04] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache rpki2003.codfw.wmnet on all recursors
[13:34:08] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) rpki2003.codfw.wmnet on all recursors
[13:34:26] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-eqsin and A:cp for 9.2.5-1wm2
[13:34:36] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM rpki2003.codfw.wmnet - ayounsi@cumin1002"
[13:34:40] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM rpki2003.codfw.wmnet - ayounsi@cumin1002"
[13:35:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P67783 and previous config saved to /var/cache/conftool/dbconfig/20240826-133524-ladsgroup.json
[13:35:34] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host rpki2003.codfw.wmnet with OS bookworm
[13:36:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10091977 (10Clement_Goubert) 05Open→03In progress
[13:36:25] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064795|Rollout Parsoid Kartographer support on all wikis (T342871)]], [[gerrit:1059394|scripts: add script for running jobs from stdin rather than http (T369048)]] (duration: 26m 53s)
[13:36:28] <urbanecm>	 finally
[13:36:29] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:36:29] <stashbot>	 T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048
[13:36:32] <wikibugs>	 (03PS3) 10Hnowlan: use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241)
[13:36:35] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[13:36:37] <ihurbain>	 thank you urbanecm !
[13:36:39] <urbanecm>	 now the last one :)
[13:36:43] <urbanecm>	 no problem ihurbain 
[13:36:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[13:37:22] <wikibugs>	 (03Merged) 10jenkins-bot: use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[13:37:33] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]]
[13:37:37] <stashbot>	 T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241
[13:40:19] <logmsgbot>	 !log urbanecm@deploy1003 hnowlan, urbanecm: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:40:29] <urbanecm>	 hnowlan: can you test, please?
[13:40:57] <hnowlan>	 urbanecm: there's no testing possible on the test servers unfortunately, this needs to go to the prod jobrunners 
[13:41:09] <urbanecm>	 ah, i see. so, we need to go ahead and see?
[13:41:14] <hnowlan>	 yep, afraid so :D 
[13:41:17] <logmsgbot>	 !log urbanecm@deploy1003 hnowlan, urbanecm: Continuing with sync
[13:41:20] <urbanecm>	 let's see then :D
[13:41:21] <hnowlan>	 this is reasonably well understood, only concern is capacity 
[13:41:34] <hnowlan>	 famous last words, on both accounts
[13:41:45] * urbanecm notes to mass-upload tons of videos shortly after finishing the window
[13:42:01] <hnowlan>	 😅
[13:45:35] <Dreamy_Jazz>	 !log Started 6hr maximum scan on nowiki - https://wikitech.wikimedia.org/wiki/MediaModeration
[13:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:38] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] (duration: 08m 04s)
[13:45:41] <stashbot>	 T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241
[13:45:48] <urbanecm>	 hnowlan: well, it's out :)
[13:45:51] <urbanecm>	 anything else?
[13:46:24] <hnowlan>	 urbanecm: that's all for me, thank you! 
[13:46:29] <urbanecm>	 no problem!
[13:46:41] <urbanecm>	 godog: i'm done. not sure if hnowlan wants a while to monitor the impact of the last change.
[13:47:41] <godog>	 urbanecm: thank you! appreciate it, I'll check the traffic here and proceed in case
[13:49:10] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041)
[13:49:11] <hnowlan>	 on my end for now I think it's just a matter of observation and possibly prayer 
[13:49:14] <wikibugs>	 (03CR) 10Clément Goubert: "See inline, this would only apply to bare-metal and not mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1049625 (https://phabricator.wikimedia.org/T356814) (owner: 10Cwhite)
[13:50:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T370903)', diff saved to https://phabricator.wikimedia.org/P67784 and previous config saved to /var/cache/conftool/dbconfig/20240826-135031-ladsgroup.json
[13:50:33] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[13:50:35] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[13:50:46] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[13:50:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67785 and previous config saved to /var/cache/conftool/dbconfig/20240826-135052-ladsgroup.json
[13:50:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy)
[13:51:03] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena)
[13:51:20] <claime>	 jouncebot: nowandnext
[13:51:20] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1300)
[13:51:20] <jouncebot>	 In 1 hour(s) and 38 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530)
[13:51:57] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2013.codfw.wmnet
[13:52:31] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2013.codfw.wmnet
[13:52:51] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on rpki2003.codfw.wmnet with reason: host reimage
[13:52:53] <wikibugs>	 (03Merged) 10jenkins-bot: data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena)
[13:53:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67786 and previous config saved to /var/cache/conftool/dbconfig/20240826-135301-ladsgroup.json
[13:53:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2013.codfw.wmnet with OS bullseye
[13:53:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[13:53:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7fc4fbcc0d30>
[13:53:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[13:55:50] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rpki2003.codfw.wmnet with reason: host reimage
[13:56:10] <godog>	 ok thank you, I'll proceed with prometheus esams bookworm upgrade
[13:56:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2013 - cgoubert@cumin1002"
[13:59:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2013 - cgoubert@cumin1002"
[13:59:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:59:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2013.codfw.wmnet 68.0.192.10.in-addr.arpa 8.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:59:07] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2013.codfw.wmnet 68.0.192.10.in-addr.arpa 8.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:59:08] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2013
[13:59:30] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2013
[13:59:30] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7fc4fbcc0d30>
[14:00:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2004.codfw.wmnet
[14:00:29] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2004.codfw.wmnet
[14:00:37] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2004.codfw.wmnet
[14:01:11] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2004.codfw.wmnet
[14:02:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2004.codfw.wmnet with OS bullseye
[14:02:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[14:03:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f13a466bd60>
[14:03:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[14:06:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2004 - cgoubert@cumin1002"
[14:06:27] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2004 - cgoubert@cumin1002"
[14:06:28] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:06:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2004.codfw.wmnet 178.16.192.10.in-addr.arpa 8.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:06:31] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2004.codfw.wmnet 178.16.192.10.in-addr.arpa 8.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:06:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2034.codfw.wmnet
[14:06:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2004
[14:06:40] <godog>	 !log start prometheus3003 bookworm upgrade - T326657
[14:06:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2004
[14:06:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f13a466bd60>
[14:07:09] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2034.codfw.wmnet
[14:07:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2034.codfw.wmnet with OS bullseye
[14:08:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P67787 and previous config saved to /var/cache/conftool/dbconfig/20240826-140808-ladsgroup.json
[14:08:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[14:08:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f61e7d94d00>
[14:10:14] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092109 (10ssingh) @KCVelaga_WMF: https://phabricator.wikimedia.org/legalpad/signatures/3/query/mfpOg6TDIwDU/#R indicates that you have signed an older version of L3. Ca...
[14:10:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[14:11:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092111 (10ssingh)
[14:12:14] <wikibugs>	 (03PS1) 10Jelto: profile::firewall::nftables_throttling: fix issue of global metering [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882)
[14:14:15] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2034 - cgoubert@cumin1002"
[14:14:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2034 - cgoubert@cumin1002"
[14:14:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:14:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2034.codfw.wmnet 57.0.192.10.in-addr.arpa 7.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:14:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2034.codfw.wmnet 57.0.192.10.in-addr.arpa 7.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:14:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2034
[14:14:34] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet
[14:14:34] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2034
[14:14:35] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f61e7d94d00>
[14:15:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2008.codfw.wmnet
[14:15:36] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3742/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[14:16:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2008.codfw.wmnet
[14:16:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage
[14:16:44] <wikibugs>	 (03PS1) 10David Caro: alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250)
[14:16:46] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2008.codfw.wmnet with OS bullseye
[14:17:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[14:17:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "more context in https://phabricator.wikimedia.org/T365259#10092085 and https://wiki.nftables.org/wiki-nftables/index.php/Meters" [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[14:17:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f33b17ddd90>
[14:17:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[14:17:52] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10092176 (10ssingh) @KFrancis: Hi! Checking the spreadsheet, it seems like we will need an NDA for @NCreasy. Thanks as always.
[14:19:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Can be merged at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1064820 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:19:38] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage
[14:19:42] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10092189 (10ssingh)
[14:20:36] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet
[14:20:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2008 - cgoubert@cumin1002"
[14:20:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2008 - cgoubert@cumin1002"
[14:20:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:20:45] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2008.codfw.wmnet 196.16.192.10.in-addr.arpa 6.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:20:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2008.codfw.wmnet 196.16.192.10.in-addr.arpa 6.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:20:49] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2008
[14:21:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2008
[14:21:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f33b17ddd90>
[14:21:10] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10092195 (10ssingh)
[14:21:53] <claime>	 !log Running homer 'cr*codfw*' commit 'T372878'
[14:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:57] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[14:22:21] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10092203 (10ssingh)
[14:23:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P67788 and previous config saved to /var/cache/conftool/dbconfig/20240826-142315-ladsgroup.json
[14:23:20] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3744/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[14:23:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage
[14:24:28] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:24:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#10092204 (10jhathaway) 05Open→03Resolved a:03jhathaway @Xover, I am going to assume this is no longer occurring, please reopen, if it occurs...
[14:24:47] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10092202 (10ssingh) @jhathaway: Can you please confirm from I/F side as part...
[14:25:36] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "This will get Puppet to install OpenJDK 17 on the hosts however:" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[14:25:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "With the latest PSes in place https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064820 no longer changes 'alertmanagers' which I think" [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:26:03] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066785
[14:26:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM when the time comes" [dns] - 10https://gerrit.wikimedia.org/r/1065258 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:26:43] <wikibugs>	 (03CR) 10Ebernhardson: search: use mul fallback for manually-tuned search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse)
[14:26:44] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage
[14:27:14] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066786
[14:27:47] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Juniper alarms (instance cr1-eqiad) - https://phabricator.wikimedia.org/T373166#10092233 (10ayounsi) →14Duplicate dup:03T372781
[14:27:58] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10092235 (10ayounsi)
[14:30:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage
[14:31:03] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "looks mostly good, one comment about the metric names in line" [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[14:31:28] <wikibugs>	 (03CR) 10Ebernhardson: "Private wikis are also now running in SUP, the cirrus load on the job queue still remains for some small use cases (wikitech, hopefully be" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse)
[14:32:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:34:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage
[14:34:28] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:36:35] <Dreamy_Jazz>	 !log Started 6hr maximum scan on group2 - https://wikitech.wikimedia.org/wiki/MediaModeration
[14:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage
[14:38:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67789 and previous config saved to /var/cache/conftool/dbconfig/20240826-143822-ladsgroup.json
[14:38:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[14:38:27] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[14:38:38] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[14:38:42] <wikibugs>	 (03CR) 10Jgiannelos: "I updated the patch with both the IPs and the hostnames per node." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos)
[14:38:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T370903)', diff saved to https://phabricator.wikimedia.org/P67790 and previous config saved to /var/cache/conftool/dbconfig/20240826-143844-ladsgroup.json
[14:39:28] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2013.codfw.wmnet with OS bullseye
[14:39:48] <wikibugs>	 (03CR) 10Ebernhardson: [C:04-1] "private wikis are now supported in SUP, the only remaining wiki is wikitech. Progress is underway in T292707 to bring wikitech into kubern" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse)
[14:39:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[14:40:22] <claime>	 !log homer 'lsw1-a5-codfw*' commit 'T372878'
[14:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:25] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[14:40:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage
[14:41:33] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2013.codfw.wmnet
[14:41:34] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2013.codfw.wmnet
[14:41:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T370903)', diff saved to https://phabricator.wikimedia.org/P67791 and previous config saved to /var/cache/conftool/dbconfig/20240826-144153-ladsgroup.json
[14:41:56] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2014.codfw.wmnet
[14:42:29] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2014.codfw.wmnet
[14:43:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:44:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092327 (10KCVelaga_WMF) @ssingh When I visit L3, it shows `You signed this document on Oct 11 2021, 6:29 PM.` I don't have any option to un-sign the older version and s...
[14:44:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2014.codfw.wmnet with OS bullseye
[14:44:38] <wikibugs>	 07Puppet, 06Infrastructure-Foundations, 06Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980#10092329 (10joanna_borun) p:05Triage→03Low
[14:44:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[14:44:45] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.100.0" for 211 hosts
[14:44:53] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f15affd4d00>
[14:45:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[14:45:29] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.100.0" completed for 211 hosts
[14:46:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2004.codfw.wmnet with OS bullseye
[14:46:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[14:46:38] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "Scap 4.100.0 (which uses this setting) has been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy)
[14:47:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:47:13] <claime>	 !log homer 'lsw1-b3-codfw*' commit T372878
[14:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:17] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[14:48:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 07Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#10092353 (10joanna_borun) p:05High→03Medium
[14:49:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2014 - cgoubert@cumin1002"
[14:49:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2014 - cgoubert@cumin1002"
[14:49:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:49:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2014.codfw.wmnet 70.0.192.10.in-addr.arpa 0.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:49:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2014.codfw.wmnet 70.0.192.10.in-addr.arpa 0.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:49:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2014
[14:49:36] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2004.codfw.wmnet
[14:49:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2004.codfw.wmnet
[14:49:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2014
[14:49:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f15affd4d00>
[14:50:42] <claime>	 !log Running homer 'cr*codfw*' commit 'T372878'
[14:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10092360 (10elukey) p:05High→03Medium Left to do:  * Make sure the new conftool package is deployed on all puppetserver no...
[14:53:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: exim should log the reason for defer with disconnect after HELO/EHLO - https://phabricator.wikimedia.org/T265142#10092368 (10jhathaway) 05Open→03Declined We have have moved to Postfix for ingress and egress, so declining.
[14:54:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2034.codfw.wmnet with OS bullseye
[14:54:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[14:54:48] <claime>	 !log homer 'lsw-a3-codfw*' commit T372878
[14:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:52] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[14:55:04] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10092374 (10joanna_borun) p:05Medium→03Low
[14:55:12] <claime>	 !log homer 'lsw1-a3-codfw*' commit T372878
[14:55:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603#10092376 (10jhathaway) @Base is this issue still ongoing?
[14:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:00] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: CAS Single Logout Flow - https://phabricator.wikimedia.org/T233941#10092382 (10SLyngshede-WMF) a:03SLyngshede-WMF
[14:56:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2034.codfw.wmnet
[14:56:33] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2034.codfw.wmnet
[14:57:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P67792 and previous config saved to /var/cache/conftool/dbconfig/20240826-145700-ladsgroup.json
[14:57:16] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: Maintain session history / audit log - https://phabricator.wikimedia.org/T233942#10092389 (10SLyngshede-WMF) p:05Medium→03Low a:03SLyngshede-WMF
[14:57:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092395 (10ssingh) >>! In T373194#10092327, @KCVelaga_WMF wrote: > @ssingh When I visit L3, it shows `You signed this document on Oct 11 2021, 6:29 PM.` I don't have any...
[14:58:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 10Observability-Alerting: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016#10092392 (10jhathaway) 05Open→03Declined Since we have migrated to Postfix, and Postfix doesn't have a panic log, declining.
[14:58:50] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10092397 (10elukey)
[14:59:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092402 (10ssingh)
[15:00:12] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2008.codfw.wmnet with OS bullseye
[15:00:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092408 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[15:00:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:54] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "Ack, Does the fact that a follow-up change is needed make this a -1 though?" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[15:01:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092417 (10ssingh) For posterity: approving manager and actual manager are the same in this case.
[15:02:13] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rpki2003.codfw.wmnet with OS bookworm
[15:02:13] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host rpki2003.codfw.wmnet
[15:02:23] <claime>	 !log homer 'lsw1-b6-codfw*' commit T372878
[15:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:26] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[15:03:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2008.codfw.wmnet
[15:03:27] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2008.codfw.wmnet
[15:04:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243#10092422 (10joanna_borun) 05Open→03Declined
[15:04:28] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:05:39] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Better detection for "reboot into PXE failed" conditions in wmf-auto-reimage - https://phabricator.wikimedia.org/T261956#10092436 (10joanna_borun) 05Open→03Declined
[15:06:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage
[15:08:43] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage
[15:08:52] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793
[15:09:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott)
[15:10:42] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793
[15:11:13] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788#10092464 (10SLyngshede-WMF) a:03SLyngshede-WMF
[15:12:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P67793 and previous config saved to /var/cache/conftool/dbconfig/20240826-151207-ladsgroup.json
[15:14:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10092513 (10ayounsi) a:03ayounsi
[15:15:17] <wikibugs>	 06SRE, 10SRE-tools, 10Icinga, 06Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447#10092518 (10joanna_borun) 05Open→03Resolved
[15:17:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524#10092537 (10joanna_borun) 05Open→03Declined
[15:22:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989#10092583 (10jhathaway) 05Open→03Resolved a:03jhathaway We assume you are now using debian's package, please re-open if something else is needed.
[15:23:25] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott)
[15:27:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T370903)', diff saved to https://phabricator.wikimedia.org/P67794 and previous config saved to /var/cache/conftool/dbconfig/20240826-152715-ladsgroup.json
[15:27:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:27:19] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:27:41] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:28:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2014.codfw.wmnet with OS bullseye
[15:28:39] <claime>	 !log homer 'lsw1-a5-codfw*' commit 'T372878'
[15:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:42] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[15:29:05] <wikibugs>	 (03CR) 10David Caro: [C:03+1] openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott)
[15:29:16] <wikibugs>	 (03PS1) 10Ayounsi: Ganeti test/routed: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560)
[15:29:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott)
[15:29:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2014.codfw.wmnet
[15:29:44] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2014.codfw.wmnet
[15:29:52] <elukey>	 jouncebot: next
[15:29:52] <jouncebot>	 In 0 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530)
[15:29:52] <wikibugs>	 (03PS2) 10Ayounsi: Ganeti test/routed: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560)
[15:30:05] <jouncebot>	 jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530)
[15:30:32] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[15:32:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[15:33:50] <wikibugs>	 (03CR) 10Herron: [C:03+1] udp2log: tag logrotated mwlogs with yesterdays date [puppet] - 10https://gerrit.wikimedia.org/r/984228 (https://phabricator.wikimedia.org/T353221) (owner: 10Cwhite)
[15:33:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[15:34:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[15:34:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T370903)', diff saved to https://phabricator.wikimedia.org/P67795 and previous config saved to /var/cache/conftool/dbconfig/20240826-153415-ladsgroup.json
[15:34:19] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:34:47] <wikibugs>	 (03CR) 10Herron: [C:03+1] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[15:35:46] <wikibugs>	 (03CR) 10Herron: [C:03+1] alert: Remove the alert[12]001 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1063233 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse)
[15:36:20] <wikibugs>	 (03CR) 10Herron: [C:03+1] alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse)
[15:36:35] <wikibugs>	 (03CR) 10Herron: [C:03+1] alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[15:36:40] <wikibugs>	 (03PS1) 10Ssingh: admin: add kcvelaga to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1066803 (https://phabricator.wikimedia.org/T373194)
[15:37:15] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066804 (https://phabricator.wikimedia.org/T128546)
[15:37:50] <jan_drewniak>	 !log starting Wikimedia Portals Update. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1066804
[15:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add kcvelaga to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1066803 (https://phabricator.wikimedia.org/T373194) (owner: 10Ssingh)
[15:40:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67796 and previous config saved to /var/cache/conftool/dbconfig/20240826-154000-ladsgroup.json
[15:40:04] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[15:40:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T370903)', diff saved to https://phabricator.wikimedia.org/P67797 and previous config saved to /var/cache/conftool/dbconfig/20240826-154024-ladsgroup.json
[15:40:28] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:41:12] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066804 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:41:56] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066804 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:42:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092760 (10ssingh) 05Open→03Resolved a:03ssingh Request merged, please try in ~30 mins and if it doesn't work, please re-open this task....
[15:42:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "Seems good although it would be nice to tie to the wmcs team (which I think is not yet possible)" [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[15:43:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Enable require_tty_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy)
[15:43:20] <mutante>	 jouncebot: now
[15:43:20] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530)
[15:44:56] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10092785 (10Gehel) p:05Triage→03High
[15:47:15] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-eqsin and A:cp for 9.2.5-1wm2
[15:47:44] <sukhe>	 !log finished upgrading A:cp-eqsin to ATS 9.2.5: T339134
[15:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:47] <stashbot>	 T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134
[15:52:36] <wikibugs>	 (03PS3) 10Dzahn: prometheus: create text file export for nft throttling denylist length [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136)
[15:53:10] <wikibugs>	 (03CR) 10Dzahn: prometheus: create text file export for nft throttling denylist length (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[15:55:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P67798 and previous config saved to /var/cache/conftool/dbconfig/20240826-155507-ladsgroup.json
[15:55:20] <logmsgbot>	 !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 09m 39s)
[15:55:31] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:55:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P67799 and previous config saved to /var/cache/conftool/dbconfig/20240826-155531-ladsgroup.json
[15:56:26] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[15:56:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2001.codfw.wmnet
[15:57:08] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2035.codfw.wmnet
[15:57:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2001.codfw.wmnet
[15:57:35] <logmsgbot>	 !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 02m 14s)
[15:57:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2035.codfw.wmnet
[16:01:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye
[16:01:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[16:01:41] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2001.codfw.wmnet with OS bullseye
[16:01:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[16:02:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack policies: open up some more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1066808
[16:02:13] <wikibugs>	 (03PS1) 10Andrew Bogott: prometheus-openstack-exporter: Use novaobserver rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1066809
[16:03:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2001.codfw.wmnet
[16:03:59] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2001.codfw.wmnet
[16:04:56] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2035.codfw.wmnet with OS bullseye
[16:05:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[16:05:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f6bc9767d90>
[16:06:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[16:07:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10092887 (10ssingh) a:05Fabfur→03None
[16:10:01] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2035 - cgoubert@cumin1002"
[16:10:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2035 - cgoubert@cumin1002"
[16:10:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:10:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2035.codfw.wmnet 62.16.192.10.in-addr.arpa 2.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:10:09] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2035.codfw.wmnet 62.16.192.10.in-addr.arpa 2.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:10:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2035
[16:10:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P67800 and previous config saved to /var/cache/conftool/dbconfig/20240826-161015-ladsgroup.json
[16:10:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2035
[16:10:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f6bc9767d90>
[16:10:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P67801 and previous config saved to /var/cache/conftool/dbconfig/20240826-161039-ladsgroup.json
[16:10:41] <wikibugs>	 (03PS2) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789)
[16:11:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[16:13:41] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: testing
[16:13:52] <logmsgbot>	 !log dancy@deploy1003 Stopping before sync operations
[16:16:40] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Enable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066810 (https://phabricator.wikimedia.org/T361724)
[16:20:10] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Enable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066810 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy)
[16:25:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67802 and previous config saved to /var/cache/conftool/dbconfig/20240826-162522-ladsgroup.json
[16:25:24] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[16:25:35] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[16:25:37] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[16:25:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T371742)', diff saved to https://phabricator.wikimedia.org/P67803 and previous config saved to /var/cache/conftool/dbconfig/20240826-162544-ladsgroup.json
[16:25:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T370903)', diff saved to https://phabricator.wikimedia.org/P67804 and previous config saved to /var/cache/conftool/dbconfig/20240826-162553-ladsgroup.json
[16:25:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[16:25:58] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:26:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[16:28:35] <claime>	 !log homer 'cr*codfw*' commit 'T372878'
[16:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:38] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[16:29:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage
[16:29:54] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance
[16:30:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance
[16:30:13] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:30:25] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:30:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T370903)', diff saved to https://phabricator.wikimedia.org/P67805 and previous config saved to /var/cache/conftool/dbconfig/20240826-163032-ladsgroup.json
[16:32:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage
[16:35:48] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093010 (10Mstyles) Hey! I'm from the security team and I didn't see either...
[16:37:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10093030 (10Jhancock.wm) 05Open→03Resolved
[16:37:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T370903)', diff saved to https://phabricator.wikimedia.org/P67806 and previous config saved to /var/cache/conftool/dbconfig/20240826-163728-ladsgroup.json
[16:37:32] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:41:12] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10093041 (10Dwisehaupt)
[16:41:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack policies: open up some more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1066808 (owner: 10Andrew Bogott)
[16:46:23] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM, let's try it." [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[16:47:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10093051 (10VRiley-WMF) ml-serve1009 Rack A2 U19 CableID 4897 Port 7  ml-serve1010 Rack E5 U3...
[16:51:41] <wikibugs>	 (03CR) 10Anzx: "namespace also needed to be updated on `wgMetaNamespace` in `wmf-config/core-Namespaces.php`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[16:52:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P67807 and previous config saved to /var/cache/conftool/dbconfig/20240826-165235-ladsgroup.json
[16:52:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2035.codfw.wmnet with OS bullseye
[16:52:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[16:52:56] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "The node_exporter pages has a comment that says: "To version 8.4", so this might not work on Bookworm which ships with drdb-utils version " [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[16:53:24] <claime>	 !log homer 'lsw1-b8-codfw*' commit T372878
[16:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:28] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[16:53:32] <wikibugs>	 (03CR) 10Andrew Bogott: "A few cinder endpoints explicitly check the admin flag and fail after this change. It still feels better to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1066809 (owner: 10Andrew Bogott)
[16:53:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] prometheus-openstack-exporter: Use novaobserver rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1066809 (owner: 10Andrew Bogott)
[16:54:36] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2035.codfw.wmnet
[16:54:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2035.codfw.wmnet
[16:55:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:00:10] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[17:00:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:02:53] <wikibugs>	 (03Abandoned) 10Ryan Kemper: wdqs: create wdqs split pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1054520 (https://phabricator.wikimedia.org/T364368) (owner: 10Stevemunene)
[17:04:12] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[17:07:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P67808 and previous config saved to /var/cache/conftool/dbconfig/20240826-170742-ladsgroup.json
[17:07:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10093179 (10VRiley-WMF)
[17:08:03] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[17:11:15] <wikibugs>	 (03CR) 10Dzahn: "Apparently this changed meant that now we can't change the JAVA version without coordinating changes in both puppet repo and deployment re" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[17:13:32] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093216 (10Aklapper) > Perhaps they've already been removed by someone else?...
[17:13:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "based on your previous +1, and since I addressed the comment, i'll go ahead. open to further fixes of course" [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[17:16:44] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: -main and -scholarly are different services [puppet] - 10https://gerrit.wikimedia.org/r/1064840 (https://phabricator.wikimedia.org/T364368)
[17:16:44] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145)
[17:16:44] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368)
[17:16:45] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368)
[17:18:25] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert^2 "wdqs graph split: routing for wdqs backends" [puppet] - 10https://gerrit.wikimedia.org/r/1066812
[17:19:13] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367)
[17:19:30] <zabe>	 Dreamy_Jazz: does the concept of global groups make sense for temporary accounts?
[17:21:08] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367)
[17:22:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T370903)', diff saved to https://phabricator.wikimedia.org/P67809 and previous config saved to /var/cache/conftool/dbconfig/20240826-172250-ladsgroup.json
[17:22:54] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:22:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance
[17:23:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance
[17:23:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on 11 hosts with reason: Maintenance
[17:23:20] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 11 hosts with reason: Maintenance
[17:27:57] <wikibugs>	 (03PS10) 10Ryan Kemper: wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364)
[17:28:06] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364)
[17:32:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[17:33:36] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[17:37:02] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper)
[17:39:03] <ryankemper>	 !log T364364 Created PTR & A records for new graph split services `wdqs-main` and `wdqs-scholarly` (merged https://gerrit.wikimedia.org/r/c/operations/dns/+/1051446 and ran `sudo authdns-update` on `dns1004.wikimedia.org`)
[17:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:06] <stashbot>	 T364364: Provision DNS and certificates for wdqs graph split domains  - https://phabricator.wikimedia.org/T364364
[17:39:49] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: -main and -scholarly are different services [puppet] - 10https://gerrit.wikimedia.org/r/1064840 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper)
[17:39:55] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[17:39:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2018.codfw.wmnet
[17:40:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[17:40:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2018.codfw.wmnet
[17:41:32] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[17:41:44] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[17:43:18] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-scholarly
[17:43:27] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-main
[17:50:47] <wikibugs>	 (03PS1) 10Kamila Součková: kubernetes: rename + re-IP kubernetes2018 [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878)
[17:51:47] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[17:52:04] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[17:52:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[17:52:46] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145)
[17:52:46] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368)
[17:52:46] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368)
[17:52:47] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367)
[17:52:53] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[17:53:10] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper)
[17:53:37] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[17:55:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379 (10jwang) 03NEW
[17:59:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093472 (10Dzahn) Hi @jwang   the request from the past you are referencing is for a different type of access.  That was for the " Requested group membership: analytics-privatedata-users, researchers...
[18:00:09] <wikibugs>	 (03CR) 10Ssingh: "A:lvs-low-traffic-eqiad or A:lvs-low-traffic-codfw." [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper)
[18:02:00] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145)
[18:02:04] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper)
[18:02:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093482 (10Dzahn) Ideally if you could use this form for access requests please:  https://phabricator.wikimedia.org/maniphest/task/edit/form/8/  Since you already made this one you can also just copy...
[18:02:49] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145)
[18:02:49] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368)
[18:02:49] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368)
[18:02:50] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367)
[18:05:37] <wikibugs>	 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093490 (10ssingh) Thanks @Dzahn!  @jwang: Happy to take care of the request once you file the task and the approvals are in.
[18:08:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[18:09:35] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[18:09:46] <wikibugs>	 (03PS7) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945)
[18:09:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[18:11:26] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[18:14:05] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance
[18:14:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance
[18:14:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T370903)', diff saved to https://phabricator.wikimedia.org/P67810 and previous config saved to /var/cache/conftool/dbconfig/20240826-181414-ladsgroup.json
[18:14:18] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:16:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T370903)', diff saved to https://phabricator.wikimedia.org/P67811 and previous config saved to /var/cache/conftool/dbconfig/20240826-181624-ladsgroup.json
[18:25:59] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Add the alert[12]002 hosts as Icinga and AM partners [puppet] - 10https://gerrit.wikimedia.org/r/1064820 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[18:31:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P67812 and previous config saved to /var/cache/conftool/dbconfig/20240826-183131-ladsgroup.json
[18:32:20] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper)
[18:33:08] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM! Not sure what a great `Hosts` selector might be for something like this ... maybe the worker and control-plane roles? (mainly as a r" [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[18:35:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093604 (10jwang)
[18:35:52] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:36:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:36:38] <sukhe>	 what happened to puppetmaster
[18:36:51] <sukhe>	 weird, it looks fine
[18:37:26] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Disable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066821 (https://phabricator.wikimedia.org/T361724)
[18:38:01] <sukhe>	 alerting for a few days apparently?
[18:39:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper)
[18:40:24] <sukhe>	 I am tihnking now this is a stale alert
[18:43:02] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002
[18:43:17] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Thanks for the reply. Looks like the issue that's annoying me is still there on idm-test. Here's a new version of this patch that should f" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński)
[18:44:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T371742)', diff saved to https://phabricator.wikimedia.org/P67813 and previous config saved to /var/cache/conftool/dbconfig/20240826-184441-ladsgroup.json
[18:44:46] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[18:45:21] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Disable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066821 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy)
[18:46:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P67814 and previous config saved to /var/cache/conftool/dbconfig/20240826-184638-ladsgroup.json
[18:47:33] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:49:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[18:50:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment group for  jwang - https://phabricator.wikimedia.org/T373379#10093667 (10jwang)
[18:50:48] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=wdqs1023*
[18:51:14] <wikibugs>	 (03PS7) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368)
[18:56:52] <wikibugs>	 (03PS8) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368)
[18:56:52] <wikibugs>	 (03PS7) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368)
[18:56:52] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367)
[18:56:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment group for  jwang - https://phabricator.wikimedia.org/T373379#10093700 (10jwang) SSH public key {F57295594}
[18:59:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P67815 and previous config saved to /var/cache/conftool/dbconfig/20240826-185948-ladsgroup.json
[19:01:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T370903)', diff saved to https://phabricator.wikimedia.org/P67816 and previous config saved to /var/cache/conftool/dbconfig/20240826-190145-ladsgroup.json
[19:01:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance
[19:01:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance
[19:01:51] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:01:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[19:01:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[19:02:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67817 and previous config saved to /var/cache/conftool/dbconfig/20240826-190201-ladsgroup.json
[19:03:00] <wikibugs>	 (03PS2) 10Kamila Součková: kubernetes: rename + re-IP kubernetes2018 [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878)
[19:03:15] <wikibugs>	 (03CR) 10Kamila Součková: "I'm not sure either, going to just leave it out I guess. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[19:03:37] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper)
[19:03:56] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename + re-IP kubernetes2018 [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[19:04:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment group for  jwang - https://phabricator.wikimedia.org/T373379#10093710 (10ssingh) @thcipriani: this requires your approval, thanks! @mpopov (approving manager) already added.
[19:04:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67818 and previous config saved to /var/cache/conftool/dbconfig/20240826-190411-ladsgroup.json
[19:05:16] <wikibugs>	 (03PS1) 10Ssingh: admin: add jiawang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1066833 (https://phabricator.wikimedia.org/T373379)
[19:05:42] <ryankemper>	 kamila_: looks like we merged patches at the same time, I went ahead and puppet-merged both of ours jfyi
[19:05:51] <ryankemper>	 !log T280001 Disabled puppet on all lvs hosts in preparation for rolling restart
[19:05:54] <kamila_>	 thanks ryankemper 
[19:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:54] <stashbot>	 T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001
[19:05:56] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for  jwang - https://phabricator.wikimedia.org/T373379#10093735 (10ssingh)
[19:06:47] <ryankemper>	 !log T280001 [eqiad] enabled puppet on eqiad lvs hosts, expecting alerts soon
[19:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2018 to wikikube-worker2041
[19:07:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[19:07:33] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:08:07] <wikibugs>	 (03PS1) 10Dzahn: gerrit/prometheus: create profile for new nft throttling exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136)
[19:11:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2018 to wikikube-worker2041 - kamila@cumin1002"
[19:11:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2018 to wikikube-worker2041 - kamila@cumin1002"
[19:11:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:11:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2041
[19:12:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2041
[19:12:54] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2018 to wikikube-worker2041
[19:13:13] <ryankemper>	 !log T280001 [eqiad] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service'`
[19:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:17] <stashbot>	 T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001
[19:13:18] <ryankemper>	 !log T280001 [eqiad] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding
[19:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20...
[19:14:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2041.codfw.wmnet with OS bullseye
[19:14:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f58347ff5e0>
[19:14:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik...
[19:14:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P67819 and previous config saved to /var/cache/conftool/dbconfig/20240826-191456-ladsgroup.json
[19:15:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[19:16:09] <ryankemper>	 !log T280001 [eqiad] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'`
[19:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:14] <ryankemper>	 !log T280001 [eqiad] `sudo ipvsadm -L -n` on lvs primary looks good, proceeding
[19:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P67820 and previous config saved to /var/cache/conftool/dbconfig/20240826-191917-ladsgroup.json
[19:20:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2041 - kamila@cumin1002"
[19:20:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2041 - kamila@cumin1002"
[19:20:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:20:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2041.codfw.wmnet 125.0.192.10.in-addr.arpa 5.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:20:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2041.codfw.wmnet 125.0.192.10.in-addr.arpa 5.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:20:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2041
[19:20:34] <ryankemper>	 !log T280001 [codfw] ran puppet on codfw lvs hosts, expecting alerts soon
[19:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:37] <stashbot>	 T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001
[19:20:54] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2041
[19:20:55] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f58347ff5e0>
[19:20:56] <ryankemper>	 !log T280001 [codfw] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service'`
[19:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:20] <sukhe>	 icinga-wm: help
[19:23:43] <ryankemper>	 !log T364368 [codfw] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding
[19:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:46] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[19:24:05] <ryankemper>	 !log T364368 [codfw] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'`
[19:24:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:42] <sukhe>	 !log sukhe@alert1001:~$ sudo systemctl restart ircecho.service
[19:24:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:00] <sukhe>	 icinga-wm: help
[19:25:21] <sukhe>	 alerts on Icinga but not here
[19:25:21] <sukhe>	 hmm
[19:25:23] <ryankemper>	 !log T364368 [codfw] `sudo ipvsadm -L -n` on lvs primary looks good, all done with lvs restarts
[19:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:36] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[19:27:40] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:27:44] <sukhe>	 ah ok it's back
[19:27:46] <wikibugs>	 (03PS8) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368)
[19:27:50] <sukhe>	 mutante: ^ back
[19:27:55] <wikibugs>	 (03PS7) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367)
[19:28:14] <mutante>	 sukhe: I tried to send "custom notification" but as always logged in with the wrong user :)
[19:28:46] <sukhe>	 :)
[19:30:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T371742)', diff saved to https://phabricator.wikimedia.org/P67821 and previous config saved to /var/cache/conftool/dbconfig/20240826-193003-ladsgroup.json
[19:30:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[19:30:11] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[19:30:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[19:30:20] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:30:25] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:30:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T371742)', diff saved to https://phabricator.wikimedia.org/P67822 and previous config saved to /var/cache/conftool/dbconfig/20240826-193032-ladsgroup.json
[19:31:30] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:31:36] <sukhe>	 ohhh
[19:34:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P67823 and previous config saved to /var/cache/conftool/dbconfig/20240826-193425-ladsgroup.json
[19:36:04] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper)
[19:36:23] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage
[19:36:30] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:38:09] <wikibugs>	 (03PS2) 10Dzahn: gerrit/prometheus: create profile for new nft throttling exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136)
[19:39:30] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:39:55] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage
[19:40:09] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1066834/3747/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[19:41:02] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093910 (10Mstyles) @Aklapper perhaps they never had security access to begi...
[19:42:27] <ryankemper>	 !log T364368 [codfw] `sudo ipvsadm -L -n` on lvs primary looks good, all done with lvs restarts
[19:42:29] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093917 (10ssingh) I don't see a history of them having being added either....
[19:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:32] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[19:42:35] <ryankemper>	 oops, wrong log message
[19:43:15] <ryankemper>	 !log T364368 Merged patch to move lvs state to `production` for `wdqs-main` and `wdqs-scholarly` (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064848) and ran puppet on all LVS hosts
[19:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:43:54] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper)
[19:44:30] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:44:43] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093923 (10jhathaway) >>! In T372767#10092202, @ssingh wrote: > @jhathaway:...
[19:45:09] <ryankemper>	 !log T364368 Merged patch to add dns discovery resources for `wdqs-main` and `wdqs-scholarly` (https://gerrit.wikimedia.org/r/c/operations/dns/+/1064831), and ran puppet on all DNS hosts
[19:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:31] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093930 (10ssingh)
[19:47:56] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:48:18] <ryankemper>	 !log T364368 Manually adding dns discovery resources to etcd corresponding to https://wikitech.wikimedia.org/wiki/LVS#Add_the_DNS_Discovery_Record
[19:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:22] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[19:48:28] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-main
[19:48:32] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly
[19:49:02] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper)
[19:49:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67824 and previous config saved to /var/cache/conftool/dbconfig/20240826-194933-ladsgroup.json
[19:49:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[19:49:37] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:49:37] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[19:49:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T370903)', diff saved to https://phabricator.wikimedia.org/P67825 and previous config saved to /var/cache/conftool/dbconfig/20240826-194944-ladsgroup.json
[19:50:23] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093931 (10ssingh) 05Open→03Resolved a:03ssingh
[19:51:30] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:51:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10093947 (10ssingh) Hi @ngkountas: It seems like the new key uploaded is also the same one that was being used in WMCS. Please generate a new key independent of WMCS an...
[19:51:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T370903)', diff saved to https://phabricator.wikimedia.org/P67826 and previous config saved to /var/cache/conftool/dbconfig/20240826-195153-ladsgroup.json
[19:52:49] <sukhe>	 ryankemper: 
[19:52:55] <sukhe>	 Aug 26 19:51:26 dns1004 confd[1821905]: 2024-08-26T19:51:26Z dns1004 /usr/bin/confd[1821905]: ERROR 100: Key not found (/conftool/v1/discovery/wdqs-main) [3293405]
[19:52:56] <jinxer-wm>	 FIRING: [18x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:52:58] <sukhe>	 Aug 26 19:51:26 dns1004 confd[1821905]: 2024-08-26T19:51:26Z dns1004 /usr/bin/confd[1821905]: ERROR 100: Key not found (/conftool/v1/discovery/wdqs-scholarly) [3293405]
[19:53:21] <sukhe>	 you definitely had a patch for adding the new services to conftool-data/discovery/services.yaml
[19:53:24] <sukhe>	 that should be merged
[19:54:15] <sukhe>	 +wdqs-main: [eqiad, codfw]
[19:54:15] <sukhe>	 +wdqs-scholarly: [eqiad, codfw]
[19:54:26] <sukhe>	 like this 
[19:55:27] <ryankemper>	 sukhe: oh, thanks for catching that. looks like we forgot to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064479
[19:55:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: new -main, -scholarly services [puppet] - 10https://gerrit.wikimedia.org/r/1064479 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper)
[19:55:44] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wdqs: new -main, -scholarly services [puppet] - 10https://gerrit.wikimedia.org/r/1064479 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper)
[19:55:49] <sukhe>	 indeed! 
[19:56:27] <ryankemper>	 sukhe: once puppet has ran on dns servers should that resolve the alert? or do we have to do any special steps to unstick things
[19:56:30] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:57:06] <sukhe>	 ryankemper: so after merging on master, we should just check that the keys in etcd have been created
[19:57:11] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[19:57:17] <sukhe>	 and the cleanup will involve removing some stale error files (but you can leave that ot me)
[19:57:24] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:57:25] <sukhe>	 so let's first merge on puppetmaster and see
[19:57:56] <jinxer-wm>	 FIRING: [32x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:58:02] <ryankemper>	 sukhe: merge done. I already kicked off a puppet run on 'A:dnsbox', sounds like that wasn't necessary though
[19:58:13] <sukhe>	 yeah, that should not affect anything there but no worries
[19:58:32] <sukhe>	 sukhe@cumin1002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/discovery/wdqs-scholarly
[19:58:35] <sukhe>	 /conftool/v1/discovery/wdqs-scholarly/eqiad
[19:58:38] <sukhe>	 /conftool/v1/discovery/wdqs-scholarly/codfw
[19:58:40] <sukhe>	 looks good
[19:58:44] <ryankemper>	 excellent
[19:59:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2041.codfw.wmnet with OS bullseye
[19:59:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub...
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T2000).
[20:00:04] <jouncebot>	 RoanKattouw, dbrant, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:12] <RoanKattouw>	 I'll deploy today
[20:00:21] <dbrant>	 o/
[20:02:56] <jinxer-wm>	 RESOLVED: [32x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:03:26] <sukhe>	 ryankemper: resolved
[20:03:56] <ryankemper>	 sukhe: thanks for all your help! (and the rest of the traffic team) that should be it for us today
[20:04:09] <sukhe>	 hth :) (thanks to brett) 
[20:04:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065201 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant)
[20:04:42] <ryankemper>	 yes ty brett!
[20:04:51] <ryankemper>	 technically waiting on puppet runs on the cp* servers to do a proper end to end test, but besides that we're all done here
[20:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: Turn account vanishing contact form into a redirect. (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065201 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant)
[20:05:09] <sukhe>	 yeah it's a good idea to do that
[20:05:10] <kamila_>	 !log run homer to add wikikube-worker2041 T372878
[20:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:14] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[20:05:51] <RoanKattouw>	 dbrant: Yours is done, beta might take some time to update though (shouldn't be longer than 10-15 minutes)
[20:05:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2041.codfw.wmnet
[20:05:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2041.codfw.wmnet
[20:06:08] <RoanKattouw>	 cscott: Are you here for your ParserMigration deployment?
[20:06:14] <dbrant>	 thx!
[20:06:59] <aude>	 I'm here to help test the Chart extension on beta
[20:07:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P67827 and previous config saved to /var/cache/conftool/dbconfig/20240826-200701-ladsgroup.json
[20:08:09] <RoanKattouw>	 I'll just proceed with the Chart change for now. It'll take a while to deploy in production due to the i18n rebuild that will be required
[20:08:18] <wikibugs>	 (03PS8) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945)
[20:08:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope)
[20:08:33] <wikibugs>	 (03CR) 10Bking: [V:03+1] "Thanks for catching this! I was wondering why it didn't go off ;)" [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi)
[20:08:36] <wikibugs>	 (03CR) 10Bking: [V:03+1 C:03+2] data-platform: fix deploy tags for stat_host [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi)
[20:09:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope)
[20:09:12] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1055984|Add Chart extension, enable in beta cluster (T369945)]]
[20:09:19] <stashbot>	 T369945: Epic: Deploy Chart extension on beta cluster - https://phabricator.wikimedia.org/T369945
[20:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: data-platform: fix deploy tags for stat_host [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi)
[20:19:17] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit/prometheus: create profile for new nft throttling exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[20:20:51] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly
[20:21:02] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-main
[20:22:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P67829 and previous config saved to /var/cache/conftool/dbconfig/20240826-202208-ladsgroup.json
[20:24:40] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[20:24:59] <aude>	 Original exception: [ZszkepHm38NxrrUhj5S1oQAAAAM] /wiki/Special:Version UnexpectedValueException: Error: invalid magic word 'chart' (do we need to wait a bit for i18n stuff)?
[20:25:00] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10094052 (10KFrancis) Hello @NCreasy please send your email address and postal address to kfrancis@wikimedia.org and I'll get the agreement out to you to sign.  Thanks!
[20:25:39] <aude>	 ok the error is gone
[20:26:48] <RoanKattouw>	 Yeah beta is still mid-deployment
[20:27:59] <logmsgbot>	 !log catrope@deploy1003 catrope: Backport for [[gerrit:1055984|Add Chart extension, enable in beta cluster (T369945)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:28:04] <stashbot>	 T369945: Epic: Deploy Chart extension on beta cluster - https://phabricator.wikimedia.org/T369945
[20:28:57] <logmsgbot>	 !log catrope@deploy1003 catrope: Continuing with sync
[20:31:58] <RoanKattouw>	 aude: Alright beta deployment is done, we can start testing now
[20:32:11] <jinxer-wm>	 RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[20:32:21] <RoanKattouw>	 I'll start by setting up a data page and a chart page on beta commons
[20:33:30] <RoanKattouw>	 Oh you beat me to it lol
[20:34:50] <RoanKattouw>	 Looks like it's all working!
[20:36:56] <RoanKattouw>	 dbrant: Your patch should now be in beta too, my charts deployment delayed yours a bit
[20:37:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T370903)', diff saved to https://phabricator.wikimedia.org/P67831 and previous config saved to /var/cache/conftool/dbconfig/20240826-203715-ladsgroup.json
[20:37:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance
[20:37:19] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:37:20] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance
[20:37:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67832 and previous config saved to /var/cache/conftool/dbconfig/20240826-203726-ladsgroup.json
[20:37:47] <dbrant>	 RoanKattouw: looks good, thanks!
[20:39:10] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1055984|Add Chart extension, enable in beta cluster (T369945)]] (duration: 29m 57s)
[20:39:13] <stashbot>	 T369945: Epic: Deploy Chart extension on beta cluster - https://phabricator.wikimedia.org/T369945
[20:39:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67833 and previous config saved to /var/cache/conftool/dbconfig/20240826-203936-ladsgroup.json
[20:43:00] <cscott>	 RoanKattouw: sorry I spaced on the timing, but if you're still around I'm game to deploy my patch.
[20:43:39] <wikibugs>	 (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes2025 as wikikube-worker2042 [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878)
[20:43:42] <RoanKattouw>	 Starting it now
[20:43:47] <wikibugs>	 (03PS3) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789)
[20:43:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:44:49] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1064963|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]]
[20:44:53] <stashbot>	 T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789
[20:47:42] <logmsgbot>	 !log catrope@deploy1003 catrope, cscott: Backport for [[gerrit:1064963|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:51:24] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP kubernetes2025 as wikikube-worker2042 [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[20:51:30] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2025.codfw.wmnet
[20:52:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove obsolete files for openstack v. antelope [puppet] - 10https://gerrit.wikimedia.org/r/1065235 (owner: 10Andrew Bogott)
[20:52:06] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2025.codfw.wmnet
[20:52:38] <RoanKattouw>	 cscott: Is this testable on the test servers, or should I just proceed?
[20:54:04] <wikibugs>	 (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[20:54:33] <wikibugs>	 (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes2025 as wikikube-worker2042 [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[20:54:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P67834 and previous config saved to /var/cache/conftool/dbconfig/20240826-205443-ladsgroup.json
[20:54:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for  jwang - https://phabricator.wikimedia.org/T373379#10094161 (10jwang) Hi @ssingh, my manager @mpopov is on PTO in the following two weeks. Can I ask his manager for approval, or is there someone else I should ask?
[20:55:27] <cscott>	 RoanKattouw: it's testable, give me a second.
[20:55:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10094163 (10Reedy)
[20:56:52] <cscott>	 oh crap, some of the CSS needed isn't going to be in place until Wednesday's train :(
[20:57:14] <cscott>	 https://en.wikivoyage.org/wiki/Coimbra the parsoid indicator top right is floating too far above the baseline :(
[20:57:40] <cscott>	 i forgot i needed that deployed
[20:58:02] <cscott>	 RoanKattouw: i'm afraid we should probably back that out, and i'll redo it on Wednesday after the train.
[20:58:17] <RoanKattouw>	 OK will do
[20:58:19] <logmsgbot>	 !log catrope@deploy1003 Sync cancelled.
[20:58:28] <cscott>	 or else I could backport the needed CSS but it's too late in the window for that i think.
[20:58:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094168 (10jwang)
[20:58:42] <cscott>	 RoanKattouw: sorry about that.
[20:59:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066881
[20:59:10] <wikibugs>	 (03CR) 10TrainBranchBot: "catrope@deploy1003 created a revert of this change as I4fbffad1102c3290c98bdfa355c5b412a473cf3f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:59:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066881 (owner: 10TrainBranchBot)
[21:00:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066881 (owner: 10TrainBranchBot)
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T2100).
[21:00:28] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1066881|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]]
[21:00:41] <wikibugs>	 (03PS1) 10C. Scott Ananian: Tweak styling of compact Parsoid indicator [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789)
[21:01:51] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2025 to wikikube-worker2042
[21:02:14] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.netbox
[21:02:30] <logmsgbot>	 !log catrope@deploy1003 catrope, trainbranchbot: Backport for [[gerrit:1066881|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:04:16] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094179 (10ssingh) >>! In T373379#10094161, @jwang wrote: > Hi @ssingh, my manager @mpopov is on PTO in the following two weeks. Can I ask his manager for approval,...
[21:04:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10094180 (10KFrancis) Hello all, I am confirming as @NCreasy is a contractor with the WMF, there is already and NDA in place.  Thanks!
[21:05:11] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott)
[21:07:31] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2025 to wikikube-worker2042 - swfrench@cumin2002"
[21:08:29] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2025 to wikikube-worker2042 - swfrench@cumin2002"
[21:08:29] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:08:31] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2042
[21:08:57] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2042
[21:09:37] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2025 to wikikube-worker2042
[21:09:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P67835 and previous config saved to /var/cache/conftool/dbconfig/20240826-210951-ladsgroup.json
[21:09:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes...
[21:16:27] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2042.codfw.wmnet on all recursors
[21:16:30] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2042.codfw.wmnet on all recursors
[21:17:11] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2042.codfw.wmnet with OS bullseye
[21:17:23] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7fe3cb9c1700>
[21:17:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094193 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w...
[21:18:33] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.netbox
[21:22:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094199 (10jwang) @kzimmerman, can you approve it while @mpopov is out?
[21:23:02] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2042 - swfrench@cumin2002"
[21:23:07] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2042 - swfrench@cumin2002"
[21:23:08] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:23:08] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2042.codfw.wmnet 20.0.192.10.in-addr.arpa 0.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:23:11] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2042.codfw.wmnet 20.0.192.10.in-addr.arpa 0.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:23:12] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2042
[21:24:24] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:24:26] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2042
[21:24:26] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7fe3cb9c1700>
[21:24:34] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:25:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67836 and previous config saved to /var/cache/conftool/dbconfig/20240826-212458-ladsgroup.json
[21:25:04] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance
[21:25:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance
[21:25:07] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[21:25:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T370903)', diff saved to https://phabricator.wikimedia.org/P67837 and previous config saved to /var/cache/conftool/dbconfig/20240826-212513-ladsgroup.json
[21:25:14] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52483 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:25:24] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:27:19] <logmsgbot>	 !log catrope@deploy1003 catrope, trainbranchbot: Continuing with sync
[21:27:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T370903)', diff saved to https://phabricator.wikimedia.org/P67838 and previous config saved to /var/cache/conftool/dbconfig/20240826-212723-ladsgroup.json
[21:30:30] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899
[21:31:50] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1066881|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] (duration: 31m 21s)
[21:38:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T371742)', diff saved to https://phabricator.wikimedia.org/P67839 and previous config saved to /var/cache/conftool/dbconfig/20240826-213807-ladsgroup.json
[21:38:11] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[21:41:37] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage
[21:42:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P67840 and previous config saved to /var/cache/conftool/dbconfig/20240826-214230-ladsgroup.json
[21:44:29] <wikibugs>	 (03CR) 10JHathaway: "@ffurnari@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway)
[21:44:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094270 (10kzimmerman) Approved as Mikhail's manager!  (Mikhail has mentioned the needs to deploy Airflow pipelines. Let me know if other questions come up that I c...
[21:45:00] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage
[21:45:41] <wikibugs>	 (03CR) 10JHathaway: "though pcc shows the metaparams being removed, in my testing the metaparams are still taken into account when applying, they are just no l" [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway)
[21:47:56] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: logging: Use '??=' operator to reduce repetition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902
[21:49:07] <wikibugs>	 (03PS1) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T366431)
[21:49:29] <wikibugs>	 (03PS2) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T366431)
[21:50:22] <wikibugs>	 (03PS3) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247)
[21:53:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P67841 and previous config saved to /var/cache/conftool/dbconfig/20240826-215314-ladsgroup.json
[21:57:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P67842 and previous config saved to /var/cache/conftool/dbconfig/20240826-215738-ladsgroup.json
[22:00:13] <wikibugs>	 (03PS2) 10Superpes15: [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264)
[22:00:21] <zabe>	 jouncebot: nowandnext
[22:00:21] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T2100)
[22:00:21] <jouncebot>	 In 3 hour(s) and 59 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0200)
[22:00:44] <wikibugs>	 (03CR) 10Zabe: [C:03+2] [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264) (owner: 10Superpes15)
[22:00:59] <wikibugs>	 (03PS3) 10Zabe: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15)
[22:01:09] <inflatador>	 !log bking@dns1004.wikimedia.org `sudo -i authdns-update` T364364
[22:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:16] <stashbot>	 T364364: Provision DNS and certificates for wdqs graph split domains  - https://phabricator.wikimedia.org/T364364
[22:01:29] <wikibugs>	 (03Merged) 10jenkins-bot: [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264) (owner: 10Superpes15)
[22:01:36] <wikibugs>	 (03PS4) 10Zabe: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15)
[22:01:38] <wikibugs>	 (03CR) 10Zabe: [C:03+2] [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15)
[22:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15)
[22:02:44] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1051757|[sysop_plwiki] Change the logo/icon and the favicon (T368712)]], [[gerrit:1052063|[arbcom_itwiki] Enable importing from itwiki (T369264)]]
[22:02:50] <stashbot>	 T368712: Change sysop_plwiki logo and favicon - https://phabricator.wikimedia.org/T368712
[22:02:50] <stashbot>	 T369264: Enable importing from itwiki on arbcom_itwiki - https://phabricator.wikimedia.org/T369264
[22:04:48] <logmsgbot>	 !log zabe@deploy1003 superpes, zabe: Backport for [[gerrit:1051757|[sysop_plwiki] Change the logo/icon and the favicon (T368712)]], [[gerrit:1052063|[arbcom_itwiki] Enable importing from itwiki (T369264)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:05:21] <logmsgbot>	 !log zabe@deploy1003 superpes, zabe: Continuing with sync
[22:05:38] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2042.codfw.wmnet with OS bullseye
[22:05:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094339 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik...
[22:07:14] <wikibugs>	 (03PS6) 10Superpes15: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683)
[22:07:15] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) (owner: 10Superpes15)
[22:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) (owner: 10Superpes15)
[22:08:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P67843 and previous config saved to /var/cache/conftool/dbconfig/20240826-220821-ladsgroup.json
[22:30:46] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:31:52] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 473, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:33:40] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:33:40] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:33:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:36:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:36:30] <swfrench-wmf>	 !log running homer 'cr*codfw*' commit 'T372878' (remove old BGP session config for kubernetes2018, kubernetes2025)
[22:36:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:34] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[22:36:40] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:36:40] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:38:05] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:39:00] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 555, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:39:26] <wikibugs>	 (03PS1) 10Jasmine_: admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909
[22:39:40] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:40:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909 (owner: 10Jasmine_)
[22:41:09] <sukhe>	 this is a fun one. what changed
[22:41:41] <swfrench-wmf>	 sukhe: the ntp alerts, or the BGP ones that just resolved?
[22:42:18] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:42:51] <sukhe>	 NTP ones, looking
[22:44:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P67850 and previous config saved to /var/cache/conftool/dbconfig/20240826-224426-ladsgroup.json
[22:48:41] <sukhe>	 ah I see it
[22:48:51] <sukhe>	 the alert hosts changed
[22:49:00] <sukhe>	 the other question I have is why didn't this alert before
[22:49:10] <sukhe>	 but that's for later I guess, we should restart ntp.service. running the cookbook
[22:49:33] <wikibugs>	 (03PS2) 10Jasmine_: admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909
[22:51:08] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox
[22:51:26] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:51:34] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:51:34] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:51:34] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[22:51:58] <sukhe>	 ok recoveries should come but this takes some time as we have a grace sleep of 15 minutes between each hosts for the NTP sync
[22:52:11] <sukhe>	 nothing to worry here as such and should not affect anything else
[22:52:29] <swfrench-wmf>	 thanks, sukhe!
[22:52:34] <sukhe>	 thanks <3
[22:52:38] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909 (owner: 10Jasmine_)
[22:59:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T370903)', diff saved to https://phabricator.wikimedia.org/P67851 and previous config saved to /var/cache/conftool/dbconfig/20240826-225933-ladsgroup.json
[22:59:39] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:00:26] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:01:04] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:01:44] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:02:45] <sukhe>	 ^ should clear up as the cookbook progresses
[23:06:54] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:07:22] <icinga-wm>	 PROBLEM - Host kubernetes2018 is DOWN: PING CRITICAL - Packet loss = 100%
[23:14:16] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:23:42] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:31:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:39:18] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:54:46] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[23:57:11] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections