[00:07:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1066131 (owner: 10TrainBranchBot) [02:17:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:27] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:39:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:26] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:16:00] (03PS1) 10Marostegui: installserver: Do not reimage db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1066431 [05:19:05] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1066431 (owner: 10Marostegui) [05:29:08] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10090462 (10Joe) For the record, the reason we wanted to support large file uploads was not to worsen the performance of upload-by-url, which has since been fixed by makin... [05:37:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:31] (03CR) 10Jgiannelos: [C:03+1] Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans) [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:36] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [06:11:50] (03PS1) 10Arnaudb: mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 [06:11:54] (03CR) 10Arnaudb: "this is to avoid noisy alerting while backups are performed." [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (owner: 10Arnaudb) [06:16:14] (03CR) 10Slyngshede: "Sorry, I got distracted. Merged patches are automatically deployed to idm-test.wikimedia.org, for testing, and only later released as part" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [06:16:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934 [06:19:29] (03PS2) 10Arnaudb: mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) [06:20:57] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10090530 (10ABran-WMF) p:05High→03Med... [06:21:02] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10090532 (10ABran-WMF) p:05Medium→03H... [06:29:28] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:59:37] (03PS3) 10Wangombe: Update reference to ElasticSearchTtmServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054869 (https://phabricator.wikimedia.org/T335342) [07:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T0700). nyaa~ [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:07:49] (03CR) 10Kevin Bazira: [C:03+1] ml-services: add new revertrisk isvcs for pre-save context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1065221 (https://phabricator.wikimedia.org/T356102) (owner: 10AikoChou) [07:14:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es7 T373168 [07:14:31] T373168: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T373168 [07:14:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T373168 [07:15:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es2039 with weight 0 T373168', diff saved to https://phabricator.wikimedia.org/P67755 and previous config saved to /var/cache/conftool/dbconfig/20240826-071504-arnaudb.json [07:17:54] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1065126 (https://phabricator.wikimedia.org/T373168) (owner: 10Gerrit maintenance bot) [07:19:12] !log Starting es7 codfw failover from es2038 to es2039 - T373168 [07:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write T373168', diff saved to https://phabricator.wikimedia.org/P67756 and previous config saved to /var/cache/conftool/dbconfig/20240826-072028-arnaudb.json [07:20:32] T373168: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T373168 [07:21:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'rebalance weights T373168', diff saved to https://phabricator.wikimedia.org/P67757 and previous config saved to /var/cache/conftool/dbconfig/20240826-072119-arnaudb.json [07:22:11] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove frdb2001 frqueue2001 payments2003 [puppet] - 10https://gerrit.wikimedia.org/r/1064942 (https://phabricator.wikimedia.org/T373149) (owner: 10Dwisehaupt) [07:32:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:34:26] (03PS1) 10Filippo Giunchedi: data-platform: fix deploy tags for stat_host [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) [07:35:41] (03CR) 10Filippo Giunchedi: "AlertLintProblem meta-alert signaled that node_load15 is missing from 'analytics' instance" [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi) [07:40:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173 [07:40:20] T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173 [07:40:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173 [07:41:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2203 with weight 0 T373173', diff saved to https://phabricator.wikimedia.org/P67758 and previous config saved to /var/cache/conftool/dbconfig/20240826-074113-arnaudb.json [08:07:35] (03CR) 10Marostegui: "Can you make a comment on top so in the future we know what this exception is for?" [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [08:11:21] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Remove old CAS 6.6 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [08:13:10] (03CR) 10AOkoth: [C:03+2] prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:15:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 depool', diff saved to https://phabricator.wikimedia.org/P67760 and previous config saved to /var/cache/conftool/dbconfig/20240826-081753-arnaudb.json [08:22:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Primary switchover s1 node in failure [08:22:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Primary switchover s1 node in failure [08:25:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:27] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:40:18] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp2003.wikimedia.org [08:41:25] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:08] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [08:46:25] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:51] (03PS1) 10Filippo Giunchedi: prometheus: remove x509ignoreCN=0 from blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066685 (https://phabricator.wikimedia.org/T326657) [08:48:15] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:48:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173 - repeat due to T373295 [08:48:45] T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173 [08:48:46] T373295: reimage db2176 - https://phabricator.wikimedia.org/T373295 [08:49:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T373173 - repeat due to T373295 [08:49:39] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:49:39] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:49:40] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2003.wikimedia.org [08:49:59] !log Starting s1 codfw failover from db2212 to db2203 - T373173 [08:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T373173', diff saved to https://phabricator.wikimedia.org/P67762 and previous config saved to /var/cache/conftool/dbconfig/20240826-085048-arnaudb.json [08:51:38] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1065132 (https://phabricator.wikimedia.org/T373173) (owner: 10Gerrit maintenance bot) [08:55:29] (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [08:56:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'weight db2212 T373173', diff saved to https://phabricator.wikimedia.org/P67763 and previous config saved to /var/cache/conftool/dbconfig/20240826-085621-arnaudb.json [08:56:25] T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173 [09:01:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:37] (03PS1) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) [09:11:21] (03CR) 10CI reject: [V:04-1] Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi) [09:13:52] (03PS2) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) [09:17:50] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp1003.wikimedia.org [09:20:11] (03CR) 10Ayounsi: "Script can be tested over there https://netbox-next.wikimedia.org/extras/scripts/37/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi) [09:21:39] (03CR) 10Ayounsi: "See related task, let me know if you think it would be useful. Otherwise I'd be ok to close the task as declined." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi) [09:22:43] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [09:23:21] (03CR) 10Clément Goubert: [C:03+1] scap.cfg.erb: Enable require_tty_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy) [09:25:59] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [09:27:04] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [09:27:04] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:27:05] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1003.wikimedia.org [09:28:50] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp-test1003.wikimedia.org [09:29:27] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:45] (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066699 [09:32:07] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Configure caching for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063765 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [09:32:13] (03PS3) 10Arnaudb: mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) [09:32:25] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066699 (owner: 10Jgiannelos) [09:32:42] (03CR) 10Arnaudb: "done!" [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [09:33:14] (03Merged) 10jenkins-bot: mobileapps: Configure caching for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063765 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [09:33:33] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [09:33:35] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064721 (owner: 10PipelineBot) [09:33:39] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063791 (owner: 10PipelineBot) [09:33:42] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062999 (owner: 10PipelineBot) [09:33:48] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061985 (owner: 10PipelineBot) [09:33:54] (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066699 (owner: 10Jgiannelos) [09:33:54] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056927 (owner: 10PipelineBot) [09:34:15] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055912 (owner: 10PipelineBot) [09:34:19] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054328 (owner: 10PipelineBot) [09:34:20] (03CR) 10Marostegui: [C:03+1] "This is fine to address the noise, but I've added Jaime as CC to see if he wants to address this in some other way too." [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [09:34:24] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052723 (owner: 10PipelineBot) [09:34:28] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052102 (owner: 10PipelineBot) [09:35:08] (03CR) 10Arnaudb: [C:03+2] mariadb: adjust warning threshold by excluding backup instances [alerts] - 10https://gerrit.wikimedia.org/r/1066451 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [09:36:51] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [09:37:09] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [09:37:09] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:37:09] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test1003.wikimedia.org [09:37:10] (03CR) 10Cathal Mooney: [C:03+2] Allow the selection of any vlan in provision server script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064387 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney) [09:39:43] (03Merged) 10jenkins-bot: Allow the selection of any vlan in provision server script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064387 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney) [09:40:30] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:42:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [09:42:39] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:43:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:45:36] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:46:40] (03PS2) 10Cathal Mooney: Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458 [09:47:20] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:51:55] (03PS1) 10Slyngshede: P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) [09:55:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [09:55:48] (03CR) 10Cathal Mooney: [C:03+2] Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney) [09:57:22] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3741/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [09:59:55] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1000) [10:00:56] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:02:22] (03PS2) 10Slyngshede: P:idp Clean up CAS 6.6 and Tomcat 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) [10:19:27] (03PS1) 10Jgiannelos: mobileapps: Use IPs instead of hostname for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 [10:24:27] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:08] (03CR) 10Clément Goubert: [C:03+1] use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:27:41] (03CR) 10Jgiannelos: "This was generated by:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (owner: 10Jgiannelos) [10:29:28] FIRING: [16x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:23] (03PS2) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) [10:30:45] FIRING: [16x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:54] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [10:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:45:15] (03PS2) 10Gmodena: data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) [10:46:34] !log Started a maximum 6 hr scan on ruwiki for MediaModeration - https://wikitech.wikimedia.org/wiki/MediaModeration [10:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:48:27] (03PS8) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [10:50:34] (03PS9) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [10:52:38] (03PS10) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [11:02:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate webperf.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:08:38] (03CR) 10Vgutierrez: [C:03+2] haproxy limit-by-path: reduce bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1065240 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [11:10:34] (03PS4) 10Hnowlan: scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) [11:11:08] (03CR) 10Fabfur: [C:04-1] admin: add new ssh key for ngkountas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1065216 (https://phabricator.wikimedia.org/T371372) (owner: 10Nik Gkountas) [11:11:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [11:13:57] !log vgutierrez@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [11:16:18] (03CR) 10Ayounsi: [C:03+2] Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [11:16:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [11:17:46] (03PS1) 10JMeybohm: eventgate: Offer readinessProbe that does not test kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) [11:17:48] (03PS1) 10JMeybohm: eventgate-main: Disable end-to-end readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) [11:18:17] (03Merged) 10jenkins-bot: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [11:18:23] (03CR) 10Hnowlan: "Sorry for the extra trouble, but for future debugging could you add the associated hostnames as comments?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos) [11:19:27] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:25:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:27:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:27:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:27:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:27:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:27:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T370903)', diff saved to https://phabricator.wikimedia.org/P67766 and previous config saved to /var/cache/conftool/dbconfig/20240826-112739-ladsgroup.json [11:27:42] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:28:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T370903)', diff saved to https://phabricator.wikimedia.org/P67767 and previous config saved to /var/cache/conftool/dbconfig/20240826-112847-ladsgroup.json [11:29:59] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:30:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:30:33] (03PS1) 10Ayounsi: Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) [11:30:35] (03PS1) 10Ayounsi: Netbox: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036) [11:30:58] (03CR) 10CI reject: [V:04-1] Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [11:31:29] (03CR) 10Filippo Giunchedi: [C:03+1] data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena) [11:32:21] (03PS2) 10Ayounsi: Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) [11:32:21] (03PS2) 10Ayounsi: Netbox: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036) [11:33:40] (03CR) 10Filippo Giunchedi: "LGTM! Nicely done, see inline re: dashboard and other than that this is ready to go I think" [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [11:33:53] (03CR) 10Ayounsi: [C:03+2] Netbox-next: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [11:34:06] (03CR) 10Ayounsi: [C:03+2] "Self-merging as it's netbox-next" [puppet] - 10https://gerrit.wikimedia.org/r/1066721 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [11:36:24] (03PS1) 10Slyngshede: Test Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1066723 [11:41:10] !log hashar@deploy1003 Started deploy [integration/docroot@c3352dd]: build: update mediawiki/mediawiki-codesniffer to 44.0.0 and micromatch to 4.0.8 [11:41:16] !log hashar@deploy1003 Finished deploy [integration/docroot@c3352dd]: build: update mediawiki/mediawiki-codesniffer to 44.0.0 and micromatch to 4.0.8 (duration: 00m 06s) [11:43:17] (03CR) 10Filippo Giunchedi: [C:03+1] "Patch LGTM, nicely done! Please note that the 'corto' Debian package will need to be uploaded to apt.w.o before this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [11:43:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P67768 and previous config saved to /var/cache/conftool/dbconfig/20240826-114354-ladsgroup.json [11:48:03] (03CR) 10Ayounsi: [V:03+1] "tested on netbox-next." [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [11:51:12] (03CR) 10Ayounsi: [C:03+2] IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) (owner: 10Ayounsi) [11:52:36] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10091553 (10ayounsi) Deployed on netbox-next and tests seem all good. [11:53:06] (03Merged) 10jenkins-bot: IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) (owner: 10Ayounsi) [11:53:34] (03CR) 10Kamila Součková: [C:03+1] shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:53:39] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:54:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:59:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P67769 and previous config saved to /var/cache/conftool/dbconfig/20240826-115901-ladsgroup.json [11:59:54] (03CR) 10Jgiannelos: "Sure thing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos) [12:00:45] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:04:15] (03PS3) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) [12:04:28] FIRING: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:05:23] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:08:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s6 T373174 [12:08:58] T373174: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T373174 [12:09:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s6 T373174 [12:09:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2129 with weight 0 T373174', diff saved to https://phabricator.wikimedia.org/P67770 and previous config saved to /var/cache/conftool/dbconfig/20240826-120921-arnaudb.json [12:12:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:14:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T370903)', diff saved to https://phabricator.wikimedia.org/P67771 and previous config saved to /var/cache/conftool/dbconfig/20240826-121408-ladsgroup.json [12:14:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:14:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:14:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:14:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67772 and previous config saved to /var/cache/conftool/dbconfig/20240826-121419-ladsgroup.json [12:14:30] (03PS3) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) [12:15:18] 10SRE-tools, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10091611 (10ayounsi) 05Open→03Resolved Validator deployed. [12:16:13] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 274607 [12:16:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 274607 [12:16:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 269115 [12:16:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269115 [12:17:05] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61754 [12:17:19] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61754 [12:17:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 263903 [12:17:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263903 [12:17:48] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268434 [12:18:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268434 [12:18:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67773 and previous config saved to /var/cache/conftool/dbconfig/20240826-121828-ladsgroup.json [12:20:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066744 [12:21:35] !log move to /root unused and about to expire cert on puppetmaster1001:/var/lib/puppet/server/ssl/ca/signed/webperf.discovery.wmnet.pem [12:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:04] (03PS1) 10Marostegui: test-s4: Add two new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1066749 [12:23:50] (03CR) 10Marostegui: [C:03+2] test-s4: Add two new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1066749 (owner: 10Marostegui) [12:25:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: Testing [12:25:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: Testing [12:25:48] (03PS1) 10Slyngshede: MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 [12:27:37] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1065133 (https://phabricator.wikimedia.org/T373174) (owner: 10Gerrit maintenance bot) [12:27:46] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate webperf.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:28:41] !log Starting s6 codfw failover from db2214 to db2129 - T373174 [12:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:45] T373174: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T373174 [12:29:16] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1066685 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:29:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary T373174', diff saved to https://phabricator.wikimedia.org/P67774 and previous config saved to /var/cache/conftool/dbconfig/20240826-122925-arnaudb.json [12:30:01] (03PS4) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) [12:31:35] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: remove x509ignoreCN=0 from blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066685 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:31:51] (03PS5) 10Jgiannelos: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) [12:32:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Weight db2214 T373174', diff saved to https://phabricator.wikimedia.org/P67775 and previous config saved to /var/cache/conftool/dbconfig/20240826-123205-arnaudb.json [12:33:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P67776 and previous config saved to /var/cache/conftool/dbconfig/20240826-123336-ladsgroup.json [12:34:16] (03PS1) 10Marostegui: mariadb: Add db2232 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1066752 [12:34:28] RESOLVED: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:34:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1066753 (https://phabricator.wikimedia.org/T373330) [12:35:36] (03CR) 10Marostegui: [C:03+2] mariadb: Add db2232 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1066752 (owner: 10Marostegui) [12:43:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091725 (10ABran-WMF) preparation job with the first few critical instances on the path is done for now. I'll have a few host to mo... [12:43:29] (03PS1) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [12:46:06] jouncebot: now and next [12:46:06] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [12:48:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P67777 and previous config saved to /var/cache/conftool/dbconfig/20240826-124843-ladsgroup.json [12:48:56] (03CR) 10Brouberol: "Note that this requires that the DAGs are injected at runtime, as the stastd client class is `wmf_airflow_common.metrics.custom_statsd_cli" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [12:56:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091780 (10ABran-WMF) this task depends on: T373175 [12:57:11] (03PS2) 10Hnowlan: shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) [12:57:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091785 (10ABran-WMF) [12:57:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091786 (10ABran-WMF) [12:59:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10091794 (10ABran-WMF) [12:59:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091795 (10ABran-WMF) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1300). [13:00:05] ihurbain and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] 👋 [13:00:16] o/ [13:00:24] o/ [13:00:46] just fyi: one of my patches is a noop, just adding a script to mediawiki-config and I was a little unsure about process. [13:00:50] i can deploy today [13:01:08] the other will (similar to previous ones) only take effect once it hits prod [13:01:08] i need a deployer today, we have synergies then :D [13:01:21] (03CR) 10Hnowlan: [C:03+2] shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:02:04] hnowlan: scheduling like you did is fine :). although having a +1 from someone would be nice, given your patch adds config. i'm curious about why is it under `scripts` (as opposed to under `rpc`, together with the other one). [13:02:26] (ah, +1s are in history, just not on the latest PS) [13:02:37] urbanecm: good question. this script is explicitly *not* an RPC script, it'll only be invoked via shell [13:03:10] fair enough, that makes sense. let's do it then. [13:03:12] at a later point as part of a Kubernetes Job object [13:03:20] thanks! [13:03:23] (03PS2) 10Isabelle Hurbain-Palatin: Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) [13:03:26] (03CR) 10Urbanecm: [C:03+2] Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:03:34] (03PS5) 10Hnowlan: scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) [13:03:37] (03CR) 10Urbanecm: [C:03+2] scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [13:03:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67778 and previous config saved to /var/cache/conftool/dbconfig/20240826-130350-ladsgroup.json [13:03:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [13:03:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:03:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [13:04:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T370903)', diff saved to https://phabricator.wikimedia.org/P67779 and previous config saved to /var/cache/conftool/dbconfig/20240826-130401-ladsgroup.json [13:05:01] (03Merged) 10jenkins-bot: shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:05:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T370903)', diff saved to https://phabricator.wikimedia.org/P67780 and previous config saved to /var/cache/conftool/dbconfig/20240826-130510-ladsgroup.json [13:05:46] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:06:08] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:06:27] (03Merged) 10jenkins-bot: Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:06:29] (03Merged) 10jenkins-bot: scripts: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [13:06:56] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [13:07:22] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [13:08:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:08:58] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [13:09:31] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1064795|Rollout Parsoid Kartographer support on all wikis (T342871)]], [[gerrit:1059394|scripts: add script for running jobs from stdin rather than http (T369048)]] [13:09:36] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:09:36] T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048 [13:10:56] (03PS1) 10Filippo Giunchedi: prometheus: SystemdUnitFailed as warning for data-persitence [puppet] - 10https://gerrit.wikimedia.org/r/1066762 (https://phabricator.wikimedia.org/T357333) [13:17:49] scap is still scapping :/ [13:20:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P67781 and previous config saved to /var/cache/conftool/dbconfig/20240826-132016-ladsgroup.json [13:21:04] (03PS1) 10Ayounsi: site.pp: extend rpki host regex to 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066770 [13:22:24] (03CR) 10Cathal Mooney: [C:03+1] site.pp: extend rpki host regex to 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066770 (owner: 10Ayounsi) [13:22:34] (03CR) 10Ayounsi: [C:03+2] site.pp: extend rpki host regex to 9 [puppet] - 10https://gerrit.wikimedia.org/r/1066770 (owner: 10Ayounsi) [13:23:34] (03CR) 10TChin: [C:03+1] data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena) [13:24:00] !log urbanecm@deploy1003 hnowlan, urbanecm, ihurbain: Backport for [[gerrit:1064795|Rollout Parsoid Kartographer support on all wikis (T342871)]], [[gerrit:1059394|scripts: add script for running jobs from stdin rather than http (T369048)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:04] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:24:05] T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048 [13:24:07] Finally [13:24:08] aha [13:24:11] ihurbain: can you test, please? [13:24:15] yup, doing that [13:24:35] hnowlan: i presume your patch can go ahead right away. unless you want to do sth at mwdebug while it's there? [13:24:51] (the script one) [13:26:08] urbanecm: nope, go ahead thanks [13:26:12] will do [13:27:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [13:27:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [13:27:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67782 and previous config saved to /var/cache/conftool/dbconfig/20240826-132738-ladsgroup.json [13:27:42] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:28:00] urbanecm: ship it [13:28:29] !log urbanecm@deploy1003 hnowlan, urbanecm, ihurbain: Continuing with sync [13:28:33] syncing! [13:28:37] woot! [13:29:36] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host rpki2003.codfw.wmnet [13:29:38] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host rpki2003.codfw.wmnet [13:30:45] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host rpki2003.codfw.wmnet [13:30:47] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:30:48] urbanecm: hi, would you mind pinging me once you are done with the deployment? thank you! [13:31:00] godog: hello! no problem, will do [13:31:09] cheers [13:31:37] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066772 [13:32:06] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066773 [13:34:00] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM rpki2003.codfw.wmnet - ayounsi@cumin1002" [13:34:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM rpki2003.codfw.wmnet - ayounsi@cumin1002" [13:34:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:34:04] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache rpki2003.codfw.wmnet on all recursors [13:34:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) rpki2003.codfw.wmnet on all recursors [13:34:26] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-eqsin and A:cp for 9.2.5-1wm2 [13:34:36] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM rpki2003.codfw.wmnet - ayounsi@cumin1002" [13:34:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM rpki2003.codfw.wmnet - ayounsi@cumin1002" [13:35:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P67783 and previous config saved to /var/cache/conftool/dbconfig/20240826-133524-ladsgroup.json [13:35:34] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host rpki2003.codfw.wmnet with OS bookworm [13:36:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10091977 (10Clement_Goubert) 05Open→03In progress [13:36:25] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064795|Rollout Parsoid Kartographer support on all wikis (T342871)]], [[gerrit:1059394|scripts: add script for running jobs from stdin rather than http (T369048)]] (duration: 26m 53s) [13:36:28] finally [13:36:29] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:36:29] T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048 [13:36:32] (03PS3) 10Hnowlan: use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) [13:36:35] (03CR) 10Urbanecm: [C:03+2] use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:36:37] thank you urbanecm ! [13:36:39] now the last one :) [13:36:43] no problem ihurbain [13:36:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:37:22] (03Merged) 10jenkins-bot: use shellbox-video globally (adding group2, including commons) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064390 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:37:33] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] [13:37:37] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:40:19] !log urbanecm@deploy1003 hnowlan, urbanecm: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:29] hnowlan: can you test, please? [13:40:57] urbanecm: there's no testing possible on the test servers unfortunately, this needs to go to the prod jobrunners [13:41:09] ah, i see. so, we need to go ahead and see? [13:41:14] yep, afraid so :D [13:41:17] !log urbanecm@deploy1003 hnowlan, urbanecm: Continuing with sync [13:41:20] let's see then :D [13:41:21] this is reasonably well understood, only concern is capacity [13:41:34] famous last words, on both accounts [13:41:45] * urbanecm notes to mass-upload tons of videos shortly after finishing the window [13:42:01] 😅 [13:45:35] !log Started 6hr maximum scan on nowiki - https://wikitech.wikimedia.org/wiki/MediaModeration [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:38] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] (duration: 08m 04s) [13:45:41] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:45:48] hnowlan: well, it's out :) [13:45:51] anything else? [13:46:24] urbanecm: that's all for me, thank you! [13:46:29] no problem! [13:46:41] godog: i'm done. not sure if hnowlan wants a while to monitor the impact of the last change. [13:47:41] urbanecm: thank you! appreciate it, I'll check the traffic here and proceed in case [13:49:10] (03PS1) 10Daimona Eaytoy: Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) [13:49:11] on my end for now I think it's just a matter of observation and possibly prayer [13:49:14] (03CR) 10Clément Goubert: "See inline, this would only apply to bare-metal and not mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1049625 (https://phabricator.wikimedia.org/T356814) (owner: 10Cwhite) [13:50:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T370903)', diff saved to https://phabricator.wikimedia.org/P67784 and previous config saved to /var/cache/conftool/dbconfig/20240826-135031-ladsgroup.json [13:50:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:50:35] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:50:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:50:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67785 and previous config saved to /var/cache/conftool/dbconfig/20240826-135052-ladsgroup.json [13:50:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy) [13:51:03] (03CR) 10Gmodena: [C:03+2] data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena) [13:51:20] jouncebot: nowandnext [13:51:20] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1300) [13:51:20] In 1 hour(s) and 38 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530) [13:51:57] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2013.codfw.wmnet [13:52:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2013.codfw.wmnet [13:52:51] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on rpki2003.codfw.wmnet with reason: host reimage [13:52:53] (03Merged) 10jenkins-bot: data-engineering: refactor MediawikiPageContentChangeEnrichAvailability [alerts] - 10https://gerrit.wikimedia.org/r/1064345 (https://phabricator.wikimedia.org/T372768) (owner: 10Gmodena) [13:53:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67786 and previous config saved to /var/cache/conftool/dbconfig/20240826-135301-ladsgroup.json [13:53:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2013.codfw.wmnet with OS bullseye [13:53:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [13:53:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [13:53:42] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:55:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rpki2003.codfw.wmnet with reason: host reimage [13:56:10] ok thank you, I'll proceed with prometheus esams bookworm upgrade [13:56:59] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2013 - cgoubert@cumin1002" [13:59:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2013 - cgoubert@cumin1002" [13:59:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:59:04] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2013.codfw.wmnet 68.0.192.10.in-addr.arpa 8.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:59:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2013.codfw.wmnet 68.0.192.10.in-addr.arpa 8.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:59:08] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2013 [13:59:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2013 [13:59:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [14:00:29] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2004.codfw.wmnet [14:00:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2004.codfw.wmnet [14:00:37] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2004.codfw.wmnet [14:01:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2004.codfw.wmnet [14:02:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2004.codfw.wmnet with OS bullseye [14:02:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:03:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [14:03:22] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:06:23] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2004 - cgoubert@cumin1002" [14:06:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2004 - cgoubert@cumin1002" [14:06:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:28] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2004.codfw.wmnet 178.16.192.10.in-addr.arpa 8.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:06:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2004.codfw.wmnet 178.16.192.10.in-addr.arpa 8.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:06:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2034.codfw.wmnet [14:06:31] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2004 [14:06:40] !log start prometheus3003 bookworm upgrade - T326657 [14:06:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2004 [14:06:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [14:07:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2034.codfw.wmnet [14:07:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2034.codfw.wmnet with OS bullseye [14:08:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P67787 and previous config saved to /var/cache/conftool/dbconfig/20240826-140808-ladsgroup.json [14:08:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:08:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [14:10:14] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092109 (10ssingh) @KCVelaga_WMF: https://phabricator.wikimedia.org/legalpad/signatures/3/query/mfpOg6TDIwDU/#R indicates that you have signed an older version of L3. Ca... [14:10:59] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:11:01] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092111 (10ssingh) [14:12:14] (03PS1) 10Jelto: profile::firewall::nftables_throttling: fix issue of global metering [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) [14:14:15] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2034 - cgoubert@cumin1002" [14:14:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2034 - cgoubert@cumin1002" [14:14:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:14:19] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2034.codfw.wmnet 57.0.192.10.in-addr.arpa 7.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:14:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2034.codfw.wmnet 57.0.192.10.in-addr.arpa 7.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:14:23] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2034 [14:14:34] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [14:14:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2034 [14:14:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [14:15:29] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2008.codfw.wmnet [14:15:36] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3742/co" [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:16:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2008.codfw.wmnet [14:16:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage [14:16:44] (03PS1) 10David Caro: alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) [14:16:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2008.codfw.wmnet with OS bullseye [14:17:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:17:02] (03CR) 10Jelto: [V:03+1] "more context in https://phabricator.wikimedia.org/T365259#10092085 and https://wiki.nftables.org/wiki-nftables/index.php/Meters" [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:17:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [14:17:24] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:17:52] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10092176 (10ssingh) @KFrancis: Hi! Checking the spreadsheet, it seems like we will need an NDA for @NCreasy. Thanks as always. [14:19:03] (03CR) 10Filippo Giunchedi: [C:03+1] "Can be merged at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1064820 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:19:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage [14:19:42] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10092189 (10ssingh) [14:20:36] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet [14:20:41] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2008 - cgoubert@cumin1002" [14:20:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2008 - cgoubert@cumin1002" [14:20:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:45] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2008.codfw.wmnet 196.16.192.10.in-addr.arpa 6.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:20:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2008.codfw.wmnet 196.16.192.10.in-addr.arpa 6.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:20:49] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2008 [14:21:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2008 [14:21:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [14:21:10] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10092195 (10ssingh) [14:21:53] !log Running homer 'cr*codfw*' commit 'T372878' [14:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:57] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:22:21] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10092203 (10ssingh) [14:23:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P67788 and previous config saved to /var/cache/conftool/dbconfig/20240826-142315-ladsgroup.json [14:23:20] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3744/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [14:23:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage [14:24:28] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:35] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Email: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#10092204 (10jhathaway) 05Open→03Resolved a:03jhathaway @Xover, I am going to assume this is no longer occurring, please reopen, if it occurs... [14:24:47] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10092202 (10ssingh) @jhathaway: Can you please confirm from I/F side as part... [14:25:36] (03CR) 10Hashar: [C:04-1] "This will get Puppet to install OpenJDK 17 on the hosts however:" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:25:46] (03CR) 10Filippo Giunchedi: "With the latest PSes in place https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064820 no longer changes 'alertmanagers' which I think" [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:26:03] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066785 [14:26:10] (03CR) 10Filippo Giunchedi: "LGTM when the time comes" [dns] - 10https://gerrit.wikimedia.org/r/1065258 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:26:43] (03CR) 10Ebernhardson: search: use mul fallback for manually-tuned search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [14:26:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage [14:27:14] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066786 [14:27:47] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Juniper alarms (instance cr1-eqiad) - https://phabricator.wikimedia.org/T373166#10092233 (10ayounsi) →14Duplicate dup:03T372781 [14:27:58] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10092235 (10ayounsi) [14:30:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage [14:31:03] (03CR) 10Jelto: [V:03+1 C:03+1] "looks mostly good, one comment about the metric names in line" [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [14:31:28] (03CR) 10Ebernhardson: "Private wikis are also now running in SUP, the cirrus load on the job queue still remains for some small use cases (wikitech, hopefully be" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [14:32:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage [14:34:28] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:36:35] !log Started 6hr maximum scan on group2 - https://wikitech.wikimedia.org/wiki/MediaModeration [14:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage [14:38:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67789 and previous config saved to /var/cache/conftool/dbconfig/20240826-143822-ladsgroup.json [14:38:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:38:27] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:38:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:38:42] (03CR) 10Jgiannelos: "I updated the patch with both the IPs and the hostnames per node." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos) [14:38:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T370903)', diff saved to https://phabricator.wikimedia.org/P67790 and previous config saved to /var/cache/conftool/dbconfig/20240826-143844-ladsgroup.json [14:39:28] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2013.codfw.wmnet with OS bullseye [14:39:48] (03CR) 10Ebernhardson: [C:04-1] "private wikis are now supported in SUP, the only remaining wiki is wikitech. Progress is underway in T292707 to bring wikitech into kubern" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse) [14:39:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:40:22] !log homer 'lsw1-a5-codfw*' commit 'T372878' [14:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:25] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:40:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage [14:41:33] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2013.codfw.wmnet [14:41:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2013.codfw.wmnet [14:41:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T370903)', diff saved to https://phabricator.wikimedia.org/P67791 and previous config saved to /var/cache/conftool/dbconfig/20240826-144153-ladsgroup.json [14:41:56] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2014.codfw.wmnet [14:42:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2014.codfw.wmnet [14:43:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:44:08] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092327 (10KCVelaga_WMF) @ssingh When I visit L3, it shows `You signed this document on Oct 11 2021, 6:29 PM.` I don't have any option to un-sign the older version and s... [14:44:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2014.codfw.wmnet with OS bullseye [14:44:38] 07Puppet, 06Infrastructure-Foundations, 06Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980#10092329 (10joanna_borun) p:05Triage→03Low [14:44:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:44:45] !log dancy@deploy1003 Installing scap version "4.100.0" for 211 hosts [14:44:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [14:45:06] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:45:29] !log dancy@deploy1003 Installation of scap version "4.100.0" completed for 211 hosts [14:46:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2004.codfw.wmnet with OS bullseye [14:46:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:46:38] (03CR) 10Ahmon Dancy: [C:03+1] "Scap 4.100.0 (which uses this setting) has been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy) [14:47:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:47:13] !log homer 'lsw1-b3-codfw*' commit T372878 [14:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:17] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:48:35] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 07Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#10092353 (10joanna_borun) p:05High→03Medium [14:49:16] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2014 - cgoubert@cumin1002" [14:49:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2014 - cgoubert@cumin1002" [14:49:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2014.codfw.wmnet 70.0.192.10.in-addr.arpa 0.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2014.codfw.wmnet 70.0.192.10.in-addr.arpa 0.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:24] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2014 [14:49:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2004.codfw.wmnet [14:49:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2004.codfw.wmnet [14:49:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2014 [14:49:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [14:50:42] !log Running homer 'cr*codfw*' commit 'T372878' [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:35] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10092360 (10elukey) p:05High→03Medium Left to do: * Make sure the new conftool package is deployed on all puppetserver no... [14:53:58] 06SRE, 06Infrastructure-Foundations, 10Mail: exim should log the reason for defer with disconnect after HELO/EHLO - https://phabricator.wikimedia.org/T265142#10092368 (10jhathaway) 05Open→03Declined We have have moved to Postfix for ingress and egress, so declining. [14:54:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2034.codfw.wmnet with OS bullseye [14:54:19] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:54:48] !log homer 'lsw-a3-codfw*' commit T372878 [14:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:55:04] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10092374 (10joanna_borun) p:05Medium→03Low [14:55:12] !log homer 'lsw1-a3-codfw*' commit T372878 [14:55:14] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603#10092376 (10jhathaway) @Base is this issue still ongoing? [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:00] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: CAS Single Logout Flow - https://phabricator.wikimedia.org/T233941#10092382 (10SLyngshede-WMF) a:03SLyngshede-WMF [14:56:32] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2034.codfw.wmnet [14:56:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2034.codfw.wmnet [14:57:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P67792 and previous config saved to /var/cache/conftool/dbconfig/20240826-145700-ladsgroup.json [14:57:16] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: Maintain session history / audit log - https://phabricator.wikimedia.org/T233942#10092389 (10SLyngshede-WMF) p:05Medium→03Low a:03SLyngshede-WMF [14:57:21] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092395 (10ssingh) >>! In T373194#10092327, @KCVelaga_WMF wrote: > @ssingh When I visit L3, it shows `You signed this document on Oct 11 2021, 6:29 PM.` I don't have any... [14:58:48] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Observability-Alerting: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016#10092392 (10jhathaway) 05Open→03Declined Since we have migrated to Postfix, and Postfix doesn't have a panic log, declining. [14:58:50] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10092397 (10elukey) [14:59:47] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092402 (10ssingh) [15:00:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2008.codfw.wmnet with OS bullseye [15:00:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092408 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:00:45] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:54] (03CR) 10Dzahn: [V:03+1] "Ack, Does the fact that a follow-up change is needed make this a -1 though?" [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [15:01:59] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092417 (10ssingh) For posterity: approving manager and actual manager are the same in this case. [15:02:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rpki2003.codfw.wmnet with OS bookworm [15:02:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host rpki2003.codfw.wmnet [15:02:23] !log homer 'lsw1-b6-codfw*' commit T372878 [15:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:26] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:03:27] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2008.codfw.wmnet [15:03:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2008.codfw.wmnet [15:04:04] 06SRE, 06Infrastructure-Foundations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243#10092422 (10joanna_borun) 05Open→03Declined [15:04:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:39] 10SRE-tools, 06Infrastructure-Foundations: Better detection for "reboot into PXE failed" conditions in wmf-auto-reimage - https://phabricator.wikimedia.org/T261956#10092436 (10joanna_borun) 05Open→03Declined [15:06:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage [15:08:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage [15:08:52] (03PS1) 10Andrew Bogott: openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 [15:09:17] (03CR) 10CI reject: [V:04-1] openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott) [15:10:42] (03PS2) 10Andrew Bogott: openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 [15:11:13] 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788#10092464 (10SLyngshede-WMF) a:03SLyngshede-WMF [15:12:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P67793 and previous config saved to /var/cache/conftool/dbconfig/20240826-151207-ladsgroup.json [15:14:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10092513 (10ayounsi) a:03ayounsi [15:15:17] 06SRE, 10SRE-tools, 10Icinga, 06Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447#10092518 (10joanna_borun) 05Open→03Resolved [15:17:04] 06SRE, 06Infrastructure-Foundations: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524#10092537 (10joanna_borun) 05Open→03Declined [15:22:20] 06SRE, 06Infrastructure-Foundations: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989#10092583 (10jhathaway) 05Open→03Resolved a:03jhathaway We assume you are now using debian's package, please re-open if something else is needed. [15:23:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott) [15:27:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T370903)', diff saved to https://phabricator.wikimedia.org/P67794 and previous config saved to /var/cache/conftool/dbconfig/20240826-152715-ladsgroup.json [15:27:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:27:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:27:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:28:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2014.codfw.wmnet with OS bullseye [15:28:39] !log homer 'lsw1-a5-codfw*' commit 'T372878' [15:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:42] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:29:05] (03CR) 10David Caro: [C:03+1] openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott) [15:29:16] (03PS1) 10Ayounsi: Ganeti test/routed: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) [15:29:39] (03CR) 10Andrew Bogott: [C:03+2] openstack/prometheus: remove openstack-exporter.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1066793 (owner: 10Andrew Bogott) [15:29:44] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2014.codfw.wmnet [15:29:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2014.codfw.wmnet [15:29:52] jouncebot: next [15:29:52] In 0 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530) [15:29:52] (03PS2) 10Ayounsi: Ganeti test/routed: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) [15:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530) [15:30:32] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [15:32:47] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:33:50] (03CR) 10Herron: [C:03+1] udp2log: tag logrotated mwlogs with yesterdays date [puppet] - 10https://gerrit.wikimedia.org/r/984228 (https://phabricator.wikimedia.org/T353221) (owner: 10Cwhite) [15:33:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [15:34:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [15:34:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T370903)', diff saved to https://phabricator.wikimedia.org/P67795 and previous config saved to /var/cache/conftool/dbconfig/20240826-153415-ladsgroup.json [15:34:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:34:47] (03CR) 10Herron: [C:03+1] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [15:35:46] (03CR) 10Herron: [C:03+1] alert: Remove the alert[12]001 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1063233 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:36:20] (03CR) 10Herron: [C:03+1] alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:36:35] (03CR) 10Herron: [C:03+1] alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [15:36:40] (03PS1) 10Ssingh: admin: add kcvelaga to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1066803 (https://phabricator.wikimedia.org/T373194) [15:37:15] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066804 (https://phabricator.wikimedia.org/T128546) [15:37:50] !log starting Wikimedia Portals Update. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1066804 [15:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:38] (03CR) 10Ssingh: [C:03+2] admin: add kcvelaga to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1066803 (https://phabricator.wikimedia.org/T373194) (owner: 10Ssingh) [15:40:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67796 and previous config saved to /var/cache/conftool/dbconfig/20240826-154000-ladsgroup.json [15:40:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:40:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T370903)', diff saved to https://phabricator.wikimedia.org/P67797 and previous config saved to /var/cache/conftool/dbconfig/20240826-154024-ladsgroup.json [15:40:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:41:12] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066804 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:41:56] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066804 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:42:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to airflow-analytics-product-admins for kcvelaga - https://phabricator.wikimedia.org/T373194#10092760 (10ssingh) 05Open→03Resolved a:03ssingh Request merged, please try in ~30 mins and if it doesn't work, please re-open this task.... [15:42:43] (03CR) 10Andrew Bogott: [C:03+1] "Seems good although it would be nice to tie to the wmcs team (which I think is not yet possible)" [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro) [15:43:12] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Enable require_tty_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1065271 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy) [15:43:20] jouncebot: now [15:43:20] For the next 0 hour(s) and 16 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T1530) [15:44:56] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10092785 (10Gehel) p:05Triage→03High [15:47:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-eqsin and A:cp for 9.2.5-1wm2 [15:47:44] !log finished upgrading A:cp-eqsin to ATS 9.2.5: T339134 [15:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:47] T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134 [15:52:36] (03PS3) 10Dzahn: prometheus: create text file export for nft throttling denylist length [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) [15:53:10] (03CR) 10Dzahn: prometheus: create text file export for nft throttling denylist length (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [15:55:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P67798 and previous config saved to /var/cache/conftool/dbconfig/20240826-155507-ladsgroup.json [15:55:20] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 09m 39s) [15:55:31] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:55:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P67799 and previous config saved to /var/cache/conftool/dbconfig/20240826-155531-ladsgroup.json [15:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [15:56:40] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2001.codfw.wmnet [15:57:08] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2035.codfw.wmnet [15:57:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2001.codfw.wmnet [15:57:35] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 02m 14s) [15:57:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2035.codfw.wmnet [16:01:20] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye [16:01:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [16:01:41] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2001.codfw.wmnet with OS bullseye [16:01:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:02:13] (03PS1) 10Andrew Bogott: Openstack policies: open up some more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1066808 [16:02:13] (03PS1) 10Andrew Bogott: prometheus-openstack-exporter: Use novaobserver rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1066809 [16:03:59] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2001.codfw.wmnet [16:03:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2001.codfw.wmnet [16:04:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2035.codfw.wmnet with OS bullseye [16:05:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [16:05:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [16:06:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [16:07:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10092887 (10ssingh) a:05Fabfur→03None [16:10:01] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2035 - cgoubert@cumin1002" [16:10:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2035 - cgoubert@cumin1002" [16:10:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:06] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2035.codfw.wmnet 62.16.192.10.in-addr.arpa 2.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:10:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2035.codfw.wmnet 62.16.192.10.in-addr.arpa 2.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:10:10] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2035 [16:10:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P67800 and previous config saved to /var/cache/conftool/dbconfig/20240826-161015-ladsgroup.json [16:10:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2035 [16:10:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [16:10:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P67801 and previous config saved to /var/cache/conftool/dbconfig/20240826-161039-ladsgroup.json [16:10:41] (03PS2) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) [16:11:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [16:13:41] !log dancy@deploy1003 Started scap sync-world: testing [16:13:52] !log dancy@deploy1003 Stopping before sync operations [16:16:40] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Enable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066810 (https://phabricator.wikimedia.org/T361724) [16:20:10] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Enable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066810 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy) [16:25:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T371742)', diff saved to https://phabricator.wikimedia.org/P67802 and previous config saved to /var/cache/conftool/dbconfig/20240826-162522-ladsgroup.json [16:25:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:25:35] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:25:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:25:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T371742)', diff saved to https://phabricator.wikimedia.org/P67803 and previous config saved to /var/cache/conftool/dbconfig/20240826-162544-ladsgroup.json [16:25:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T370903)', diff saved to https://phabricator.wikimedia.org/P67804 and previous config saved to /var/cache/conftool/dbconfig/20240826-162553-ladsgroup.json [16:25:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:25:58] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:26:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:28:35] !log homer 'cr*codfw*' commit 'T372878' [16:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:38] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:29:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage [16:29:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:30:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:30:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:30:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:30:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T370903)', diff saved to https://phabricator.wikimedia.org/P67805 and previous config saved to /var/cache/conftool/dbconfig/20240826-163032-ladsgroup.json [16:32:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage [16:35:48] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093010 (10Mstyles) Hey! I'm from the security team and I didn't see either... [16:37:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10093030 (10Jhancock.wm) 05Open→03Resolved [16:37:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T370903)', diff saved to https://phabricator.wikimedia.org/P67806 and previous config saved to /var/cache/conftool/dbconfig/20240826-163728-ladsgroup.json [16:37:32] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:41:12] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003 - https://phabricator.wikimedia.org/T373149#10093041 (10Dwisehaupt) [16:41:51] (03CR) 10Andrew Bogott: [C:03+2] Openstack policies: open up some more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1066808 (owner: 10Andrew Bogott) [16:46:23] (03CR) 10Slyngshede: [C:03+1] "LGTM, let's try it." [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [16:47:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10093051 (10VRiley-WMF) ml-serve1009 Rack A2 U19 CableID 4897 Port 7 ml-serve1010 Rack E5 U3... [16:51:41] (03CR) 10Anzx: "namespace also needed to be updated on `wgMetaNamespace` in `wmf-config/core-Namespaces.php`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [16:52:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P67807 and previous config saved to /var/cache/conftool/dbconfig/20240826-165235-ladsgroup.json [16:52:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2035.codfw.wmnet with OS bullseye [16:52:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:52:56] (03CR) 10Slyngshede: [C:03+1] "The node_exporter pages has a comment that says: "To version 8.4", so this might not work on Bookworm which ships with drdb-utils version " [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [16:53:24] !log homer 'lsw1-b8-codfw*' commit T372878 [16:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:28] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:53:32] (03CR) 10Andrew Bogott: "A few cinder endpoints explicitly check the admin flag and fail after this change. It still feels better to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1066809 (owner: 10Andrew Bogott) [16:53:40] (03CR) 10Andrew Bogott: [C:03+2] prometheus-openstack-exporter: Use novaobserver rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1066809 (owner: 10Andrew Bogott) [16:54:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2035.codfw.wmnet [16:54:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2035.codfw.wmnet [16:55:30] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:10] (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [17:00:30] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:53] (03Abandoned) 10Ryan Kemper: wdqs: create wdqs split pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1054520 (https://phabricator.wikimedia.org/T364368) (owner: 10Stevemunene) [17:04:12] (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [17:07:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P67808 and previous config saved to /var/cache/conftool/dbconfig/20240826-170742-ladsgroup.json [17:07:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10093179 (10VRiley-WMF) [17:08:03] (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [17:11:15] (03CR) 10Dzahn: "Apparently this changed meant that now we can't change the JAVA version without coordinating changes in both puppet repo and deployment re" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [17:13:32] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093216 (10Aklapper) > Perhaps they've already been removed by someone else?... [17:13:50] (03CR) 10Dzahn: [C:03+2] "based on your previous +1, and since I addressed the comment, i'll go ahead. open to further fixes of course" [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [17:16:44] (03PS2) 10Ryan Kemper: wdqs: -main and -scholarly are different services [puppet] - 10https://gerrit.wikimedia.org/r/1064840 (https://phabricator.wikimedia.org/T364368) [17:16:44] (03PS3) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) [17:16:44] (03PS4) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [17:16:45] (03PS4) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [17:18:25] (03PS1) 10Ryan Kemper: Revert^2 "wdqs graph split: routing for wdqs backends" [puppet] - 10https://gerrit.wikimedia.org/r/1066812 [17:19:13] (03PS2) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) [17:19:30] Dreamy_Jazz: does the concept of global groups make sense for temporary accounts? [17:21:08] (03PS3) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) [17:22:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T370903)', diff saved to https://phabricator.wikimedia.org/P67809 and previous config saved to /var/cache/conftool/dbconfig/20240826-172250-ladsgroup.json [17:22:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:22:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:23:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:23:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on 11 hosts with reason: Maintenance [17:23:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 11 hosts with reason: Maintenance [17:27:57] (03PS10) 10Ryan Kemper: wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) [17:28:06] (03PS5) 10Ryan Kemper: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) [17:32:20] (03CR) 10Scott French: [C:03+2] eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [17:33:36] (03Merged) 10jenkins-bot: eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [17:37:02] (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [17:39:03] !log T364364 Created PTR & A records for new graph split services `wdqs-main` and `wdqs-scholarly` (merged https://gerrit.wikimedia.org/r/c/operations/dns/+/1051446 and ran `sudo authdns-update` on `dns1004.wikimedia.org`) [17:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:06] T364364: Provision DNS and certificates for wdqs graph split domains - https://phabricator.wikimedia.org/T364364 [17:39:49] (03CR) 10Ryan Kemper: [C:03+2] wdqs: -main and -scholarly are different services [puppet] - 10https://gerrit.wikimedia.org/r/1064840 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [17:39:55] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [17:39:59] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2018.codfw.wmnet [17:40:29] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [17:40:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2018.codfw.wmnet [17:41:32] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [17:41:44] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [17:43:18] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-scholarly [17:43:27] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-main [17:50:47] (03PS1) 10Kamila Součková: kubernetes: rename + re-IP kubernetes2018 [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) [17:51:47] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [17:52:04] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [17:52:34] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [17:52:46] (03PS4) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) [17:52:46] (03PS5) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [17:52:46] (03PS5) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [17:52:47] (03PS4) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) [17:52:53] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [17:53:10] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper) [17:53:37] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [17:55:22] 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379 (10jwang) 03NEW [17:59:50] 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093472 (10Dzahn) Hi @jwang the request from the past you are referencing is for a different type of access. That was for the " Requested group membership: analytics-privatedata-users, researchers... [18:00:09] (03CR) 10Ssingh: "A:lvs-low-traffic-eqiad or A:lvs-low-traffic-codfw." [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper) [18:02:00] (03PS5) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) [18:02:04] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper) [18:02:39] 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093482 (10Dzahn) Ideally if you could use this form for access requests please: https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ Since you already made this one you can also just copy... [18:02:49] (03PS6) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) [18:02:49] (03PS6) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [18:02:49] (03PS6) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [18:02:50] (03PS5) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) [18:05:37] 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093490 (10ssingh) Thanks @Dzahn! @jwang: Happy to take care of the request once you file the task and the approvals are in. [18:08:34] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [18:09:35] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [18:09:46] (03PS7) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [18:09:54] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:11:26] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:14:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [18:14:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [18:14:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T370903)', diff saved to https://phabricator.wikimedia.org/P67810 and previous config saved to /var/cache/conftool/dbconfig/20240826-181414-ladsgroup.json [18:14:18] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:16:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T370903)', diff saved to https://phabricator.wikimedia.org/P67811 and previous config saved to /var/cache/conftool/dbconfig/20240826-181624-ladsgroup.json [18:25:59] (03CR) 10Andrea Denisse: [C:03+2] alert: Add the alert[12]002 hosts as Icinga and AM partners [puppet] - 10https://gerrit.wikimedia.org/r/1064820 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [18:31:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P67812 and previous config saved to /var/cache/conftool/dbconfig/20240826-183131-ladsgroup.json [18:32:20] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper) [18:33:08] (03CR) 10Scott French: [C:03+1] "LGTM! Not sure what a great `Hosts` selector might be for something like this ... maybe the worker and control-plane roles? (mainly as a r" [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [18:35:12] 06SRE, 10SRE-Access-Requests: Failed to ssh deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T373379#10093604 (10jwang) [18:35:52] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:36:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:38] what happened to puppetmaster [18:36:51] weird, it looks fine [18:37:26] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Disable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066821 (https://phabricator.wikimedia.org/T361724) [18:38:01] alerting for a few days apparently? [18:39:43] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) (owner: 10Ryan Kemper) [18:40:24] I am tihnking now this is a stale alert [18:43:02] (03PS2) 10Bartosz Dziewoński: Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 [18:43:17] (03CR) 10Bartosz Dziewoński: "Thanks for the reply. Looks like the issue that's annoying me is still there on idm-test. Here's a new version of this patch that should f" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [18:44:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T371742)', diff saved to https://phabricator.wikimedia.org/P67813 and previous config saved to /var/cache/conftool/dbconfig/20240826-184441-ladsgroup.json [18:44:46] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:45:21] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Disable require_terminal_multiplexer [puppet] - 10https://gerrit.wikimedia.org/r/1066821 (https://phabricator.wikimedia.org/T361724) (owner: 10Ahmon Dancy) [18:46:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P67814 and previous config saved to /var/cache/conftool/dbconfig/20240826-184638-ladsgroup.json [18:47:33] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:49:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:50:03] 06SRE, 10SRE-Access-Requests: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10093667 (10jwang) [18:50:48] !log ryankemper@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=wdqs1023* [18:51:14] (03PS7) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [18:56:52] (03PS8) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [18:56:52] (03PS7) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [18:56:52] (03PS6) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) [18:56:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10093700 (10jwang) SSH public key {F57295594} [18:59:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P67815 and previous config saved to /var/cache/conftool/dbconfig/20240826-185948-ladsgroup.json [19:01:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T370903)', diff saved to https://phabricator.wikimedia.org/P67816 and previous config saved to /var/cache/conftool/dbconfig/20240826-190145-ladsgroup.json [19:01:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [19:01:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [19:01:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:01:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:01:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:02:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67817 and previous config saved to /var/cache/conftool/dbconfig/20240826-190201-ladsgroup.json [19:03:00] (03PS2) 10Kamila Součková: kubernetes: rename + re-IP kubernetes2018 [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) [19:03:15] (03CR) 10Kamila Součková: "I'm not sure either, going to just leave it out I guess. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [19:03:37] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [19:03:56] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename + re-IP kubernetes2018 [puppet] - 10https://gerrit.wikimedia.org/r/1066814 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [19:04:11] 06SRE, 10SRE-Access-Requests: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10093710 (10ssingh) @thcipriani: this requires your approval, thanks! @mpopov (approving manager) already added. [19:04:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67818 and previous config saved to /var/cache/conftool/dbconfig/20240826-190411-ladsgroup.json [19:05:16] (03PS1) 10Ssingh: admin: add jiawang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1066833 (https://phabricator.wikimedia.org/T373379) [19:05:42] kamila_: looks like we merged patches at the same time, I went ahead and puppet-merged both of ours jfyi [19:05:51] !log T280001 Disabled puppet on all lvs hosts in preparation for rolling restart [19:05:54] thanks ryankemper [19:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:54] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [19:05:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10093735 (10ssingh) [19:06:47] !log T280001 [eqiad] enabled puppet on eqiad lvs hosts, expecting alerts soon [19:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:11] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2018 to wikikube-worker2041 [19:07:28] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:07:33] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:08:07] (03PS1) 10Dzahn: gerrit/prometheus: create profile for new nft throttling exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136) [19:11:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2018 to wikikube-worker2041 - kamila@cumin1002" [19:11:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2018 to wikikube-worker2041 - kamila@cumin1002" [19:11:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:57] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2041 [19:12:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2041 [19:12:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2018 to wikikube-worker2041 [19:13:13] !log T280001 [eqiad] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service'` [19:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:17] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [19:13:18] !log T280001 [eqiad] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding [19:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20... [19:14:18] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2041.codfw.wmnet with OS bullseye [19:14:28] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host [19:14:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [19:14:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P67819 and previous config saved to /var/cache/conftool/dbconfig/20240826-191456-ladsgroup.json [19:15:44] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:16:09] !log T280001 [eqiad] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'` [19:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:14] !log T280001 [eqiad] `sudo ipvsadm -L -n` on lvs primary looks good, proceeding [19:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P67820 and previous config saved to /var/cache/conftool/dbconfig/20240826-191917-ladsgroup.json [19:20:19] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2041 - kamila@cumin1002" [19:20:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2041 - kamila@cumin1002" [19:20:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:20:24] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2041.codfw.wmnet 125.0.192.10.in-addr.arpa 5.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:20:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2041.codfw.wmnet 125.0.192.10.in-addr.arpa 5.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:20:28] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2041 [19:20:34] !log T280001 [codfw] ran puppet on codfw lvs hosts, expecting alerts soon [19:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:37] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [19:20:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2041 [19:20:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [19:20:56] !log T280001 [codfw] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service'` [19:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:20] icinga-wm: help [19:23:43] !log T364368 [codfw] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding [19:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:46] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [19:24:05] !log T364368 [codfw] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'` [19:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:42] !log sukhe@alert1001:~$ sudo systemctl restart ircecho.service [19:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:00] icinga-wm: help [19:25:21] alerts on Icinga but not here [19:25:21] hmm [19:25:23] !log T364368 [codfw] `sudo ipvsadm -L -n` on lvs primary looks good, all done with lvs restarts [19:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:36] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [19:27:40] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:27:44] ah ok it's back [19:27:46] (03PS8) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [19:27:50] mutante: ^ back [19:27:55] (03PS7) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) [19:28:14] sukhe: I tried to send "custom notification" but as always logged in with the wrong user :) [19:28:46] :) [19:30:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T371742)', diff saved to https://phabricator.wikimedia.org/P67821 and previous config saved to /var/cache/conftool/dbconfig/20240826-193003-ladsgroup.json [19:30:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [19:30:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:30:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [19:30:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:30:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:30:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T371742)', diff saved to https://phabricator.wikimedia.org/P67822 and previous config saved to /var/cache/conftool/dbconfig/20240826-193032-ladsgroup.json [19:31:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:31:36] ohhh [19:34:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P67823 and previous config saved to /var/cache/conftool/dbconfig/20240826-193425-ladsgroup.json [19:36:04] (03CR) 10Ryan Kemper: [C:03+2] wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [19:36:23] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage [19:36:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:09] (03PS2) 10Dzahn: gerrit/prometheus: create profile for new nft throttling exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136) [19:39:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:39:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage [19:40:09] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1066834/3747/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [19:41:02] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093910 (10Mstyles) @Aklapper perhaps they never had security access to begi... [19:42:27] !log T364368 [codfw] `sudo ipvsadm -L -n` on lvs primary looks good, all done with lvs restarts [19:42:29] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093917 (10ssingh) I don't see a history of them having being added either.... [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:32] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [19:42:35] oops, wrong log message [19:43:15] !log T364368 Merged patch to move lvs state to `production` for `wdqs-main` and `wdqs-scholarly` (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064848) and ran puppet on all LVS hosts [19:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:43:54] (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [19:44:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:44:43] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093923 (10jhathaway) >>! In T372767#10092202, @ssingh wrote: > @jhathaway:... [19:45:09] !log T364368 Merged patch to add dns discovery resources for `wdqs-main` and `wdqs-scholarly` (https://gerrit.wikimedia.org/r/c/operations/dns/+/1064831), and ran puppet on all DNS hosts [19:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:31] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093930 (10ssingh) [19:47:56] FIRING: [4x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:48:18] !log T364368 Manually adding dns discovery resources to etcd corresponding to https://wikitech.wikimedia.org/wiki/LVS#Add_the_DNS_Discovery_Record [19:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:22] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [19:48:28] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-main [19:48:32] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly [19:49:02] (03CR) 10Ryan Kemper: [C:03+2] wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1066812 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [19:49:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67824 and previous config saved to /var/cache/conftool/dbconfig/20240826-194933-ladsgroup.json [19:49:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [19:49:37] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:49:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [19:49:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T370903)', diff saved to https://phabricator.wikimedia.org/P67825 and previous config saved to /var/cache/conftool/dbconfig/20240826-194944-ladsgroup.json [19:50:23] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10093931 (10ssingh) 05Open→03Resolved a:03ssingh [19:51:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:51:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10093947 (10ssingh) Hi @ngkountas: It seems like the new key uploaded is also the same one that was being used in WMCS. Please generate a new key independent of WMCS an... [19:51:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T370903)', diff saved to https://phabricator.wikimedia.org/P67826 and previous config saved to /var/cache/conftool/dbconfig/20240826-195153-ladsgroup.json [19:52:49] ryankemper: [19:52:55] Aug 26 19:51:26 dns1004 confd[1821905]: 2024-08-26T19:51:26Z dns1004 /usr/bin/confd[1821905]: ERROR 100: Key not found (/conftool/v1/discovery/wdqs-main) [3293405] [19:52:56] FIRING: [18x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:52:58] Aug 26 19:51:26 dns1004 confd[1821905]: 2024-08-26T19:51:26Z dns1004 /usr/bin/confd[1821905]: ERROR 100: Key not found (/conftool/v1/discovery/wdqs-scholarly) [3293405] [19:53:21] you definitely had a patch for adding the new services to conftool-data/discovery/services.yaml [19:53:24] that should be merged [19:54:15] +wdqs-main: [eqiad, codfw] [19:54:15] +wdqs-scholarly: [eqiad, codfw] [19:54:26] like this [19:55:27] sukhe: oh, thanks for catching that. looks like we forgot to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064479 [19:55:43] (03CR) 10Ryan Kemper: [C:03+2] wdqs: new -main, -scholarly services [puppet] - 10https://gerrit.wikimedia.org/r/1064479 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [19:55:44] (03CR) 10Ssingh: [C:03+1] wdqs: new -main, -scholarly services [puppet] - 10https://gerrit.wikimedia.org/r/1064479 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [19:55:49] indeed! [19:56:27] sukhe: once puppet has ran on dns servers should that resolve the alert? or do we have to do any special steps to unstick things [19:56:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:57:06] ryankemper: so after merging on master, we should just check that the keys in etcd have been created [19:57:11] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [19:57:17] and the cleanup will involve removing some stale error files (but you can leave that ot me) [19:57:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:57:25] so let's first merge on puppetmaster and see [19:57:56] FIRING: [32x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:58:02] sukhe: merge done. I already kicked off a puppet run on 'A:dnsbox', sounds like that wasn't necessary though [19:58:13] yeah, that should not affect anything there but no worries [19:58:32] sukhe@cumin1002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/discovery/wdqs-scholarly [19:58:35] /conftool/v1/discovery/wdqs-scholarly/eqiad [19:58:38] /conftool/v1/discovery/wdqs-scholarly/codfw [19:58:40] looks good [19:58:44] excellent [19:59:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2041.codfw.wmnet with OS bullseye [19:59:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T2000). [20:00:04] RoanKattouw, dbrant, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] I'll deploy today [20:00:21] o/ [20:02:56] RESOLVED: [32x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-wdqs-main.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:03:26] ryankemper: resolved [20:03:56] sukhe: thanks for all your help! (and the rest of the traffic team) that should be it for us today [20:04:09] hth :) (thanks to brett) [20:04:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065201 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant) [20:04:42] yes ty brett! [20:04:51] technically waiting on puppet runs on the cp* servers to do a proper end to end test, but besides that we're all done here [20:05:01] (03Merged) 10jenkins-bot: Turn account vanishing contact form into a redirect. (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065201 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant) [20:05:09] yeah it's a good idea to do that [20:05:10] !log run homer to add wikikube-worker2041 T372878 [20:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:14] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [20:05:51] dbrant: Yours is done, beta might take some time to update though (shouldn't be longer than 10-15 minutes) [20:05:51] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2041.codfw.wmnet [20:05:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2041.codfw.wmnet [20:06:08] cscott: Are you here for your ParserMigration deployment? [20:06:14] thx! [20:06:59] I'm here to help test the Chart extension on beta [20:07:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P67827 and previous config saved to /var/cache/conftool/dbconfig/20240826-200701-ladsgroup.json [20:08:09] I'll just proceed with the Chart change for now. It'll take a while to deploy in production due to the i18n rebuild that will be required [20:08:18] (03PS8) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [20:08:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [20:08:33] (03CR) 10Bking: [V:03+1] "Thanks for catching this! I was wondering why it didn't go off ;)" [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi) [20:08:36] (03CR) 10Bking: [V:03+1 C:03+2] data-platform: fix deploy tags for stat_host [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi) [20:09:03] (03Merged) 10jenkins-bot: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [20:09:12] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1055984|Add Chart extension, enable in beta cluster (T369945)]] [20:09:19] T369945: Epic: Deploy Chart extension on beta cluster - https://phabricator.wikimedia.org/T369945 [20:09:47] (03Merged) 10jenkins-bot: data-platform: fix deploy tags for stat_host [alerts] - 10https://gerrit.wikimedia.org/r/1066661 (https://phabricator.wikimedia.org/T373046) (owner: 10Filippo Giunchedi) [20:19:17] (03CR) 10Dzahn: [C:03+2] gerrit/prometheus: create profile for new nft throttling exporter [puppet] - 10https://gerrit.wikimedia.org/r/1066834 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [20:20:51] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly [20:21:02] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-main [20:22:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P67829 and previous config saved to /var/cache/conftool/dbconfig/20240826-202208-ladsgroup.json [20:24:40] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:24:59] Original exception: [ZszkepHm38NxrrUhj5S1oQAAAAM] /wiki/Special:Version UnexpectedValueException: Error: invalid magic word 'chart' (do we need to wait a bit for i18n stuff)? [20:25:00] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10094052 (10KFrancis) Hello @NCreasy please send your email address and postal address to kfrancis@wikimedia.org and I'll get the agreement out to you to sign. Thanks! [20:25:39] ok the error is gone [20:26:48] Yeah beta is still mid-deployment [20:27:59] !log catrope@deploy1003 catrope: Backport for [[gerrit:1055984|Add Chart extension, enable in beta cluster (T369945)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:04] T369945: Epic: Deploy Chart extension on beta cluster - https://phabricator.wikimedia.org/T369945 [20:28:57] !log catrope@deploy1003 catrope: Continuing with sync [20:31:58] aude: Alright beta deployment is done, we can start testing now [20:32:11] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:32:21] I'll start by setting up a data page and a chart page on beta commons [20:33:30] Oh you beat me to it lol [20:34:50] Looks like it's all working! [20:36:56] dbrant: Your patch should now be in beta too, my charts deployment delayed yours a bit [20:37:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T370903)', diff saved to https://phabricator.wikimedia.org/P67831 and previous config saved to /var/cache/conftool/dbconfig/20240826-203715-ladsgroup.json [20:37:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:37:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:37:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:37:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67832 and previous config saved to /var/cache/conftool/dbconfig/20240826-203726-ladsgroup.json [20:37:47] RoanKattouw: looks good, thanks! [20:39:10] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1055984|Add Chart extension, enable in beta cluster (T369945)]] (duration: 29m 57s) [20:39:13] T369945: Epic: Deploy Chart extension on beta cluster - https://phabricator.wikimedia.org/T369945 [20:39:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67833 and previous config saved to /var/cache/conftool/dbconfig/20240826-203936-ladsgroup.json [20:43:00] RoanKattouw: sorry I spaced on the timing, but if you're still around I'm game to deploy my patch. [20:43:39] (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes2025 as wikikube-worker2042 [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) [20:43:42] Starting it now [20:43:47] (03PS3) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) [20:43:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:44:38] (03Merged) 10jenkins-bot: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:44:49] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1064963|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]] [20:44:53] T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789 [20:47:42] !log catrope@deploy1003 catrope, cscott: Backport for [[gerrit:1064963|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:51:24] (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP kubernetes2025 as wikikube-worker2042 [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [20:51:30] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2025.codfw.wmnet [20:52:05] (03CR) 10Andrew Bogott: [C:03+2] Remove obsolete files for openstack v. antelope [puppet] - 10https://gerrit.wikimedia.org/r/1065235 (owner: 10Andrew Bogott) [20:52:06] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2025.codfw.wmnet [20:52:38] cscott: Is this testable on the test servers, or should I just proceed? [20:54:04] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [20:54:33] (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes2025 as wikikube-worker2042 [puppet] - 10https://gerrit.wikimedia.org/r/1066878 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [20:54:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P67834 and previous config saved to /var/cache/conftool/dbconfig/20240826-205443-ladsgroup.json [20:54:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10094161 (10jwang) Hi @ssingh, my manager @mpopov is on PTO in the following two weeks. Can I ask his manager for approval, or is there someone else I should ask? [20:55:27] RoanKattouw: it's testable, give me a second. [20:55:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jwang - https://phabricator.wikimedia.org/T373379#10094163 (10Reedy) [20:56:52] oh crap, some of the CSS needed isn't going to be in place until Wednesday's train :( [20:57:14] https://en.wikivoyage.org/wiki/Coimbra the parsoid indicator top right is floating too far above the baseline :( [20:57:40] i forgot i needed that deployed [20:58:02] RoanKattouw: i'm afraid we should probably back that out, and i'll redo it on Wednesday after the train. [20:58:17] OK will do [20:58:19] !log catrope@deploy1003 Sync cancelled. [20:58:28] or else I could backport the needed CSS but it's too late in the window for that i think. [20:58:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094168 (10jwang) [20:58:42] RoanKattouw: sorry about that. [20:59:10] (03PS1) 10TrainBranchBot: Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066881 [20:59:10] (03CR) 10TrainBranchBot: "catrope@deploy1003 created a revert of this change as I4fbffad1102c3290c98bdfa355c5b412a473cf3f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:59:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066881 (owner: 10TrainBranchBot) [21:00:02] (03Merged) 10jenkins-bot: Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066881 (owner: 10TrainBranchBot) [21:00:04] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T2100). [21:00:28] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1066881|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] [21:00:41] (03PS1) 10C. Scott Ananian: Tweak styling of compact Parsoid indicator [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) [21:01:51] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2025 to wikikube-worker2042 [21:02:14] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:02:30] !log catrope@deploy1003 catrope, trainbranchbot: Backport for [[gerrit:1066881|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:04:16] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094179 (10ssingh) >>! In T373379#10094161, @jwang wrote: > Hi @ssingh, my manager @mpopov is on PTO in the following two weeks. Can I ask his manager for approval,... [21:04:37] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10094180 (10KFrancis) Hello all, I am confirming as @NCreasy is a contractor with the WMF, there is already and NDA in place. Thanks! [21:05:11] (03Abandoned) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [21:07:31] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2025 to wikikube-worker2042 - swfrench@cumin2002" [21:08:29] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2025 to wikikube-worker2042 - swfrench@cumin2002" [21:08:29] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:08:31] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2042 [21:08:57] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2042 [21:09:37] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2025 to wikikube-worker2042 [21:09:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P67835 and previous config saved to /var/cache/conftool/dbconfig/20240826-210951-ladsgroup.json [21:09:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [21:16:27] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2042.codfw.wmnet on all recursors [21:16:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2042.codfw.wmnet on all recursors [21:17:11] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2042.codfw.wmnet with OS bullseye [21:17:23] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host [21:17:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094193 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [21:18:33] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:22:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094199 (10jwang) @kzimmerman, can you approve it while @mpopov is out? [21:23:02] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2042 - swfrench@cumin2002" [21:23:07] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2042 - swfrench@cumin2002" [21:23:08] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:23:08] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2042.codfw.wmnet 20.0.192.10.in-addr.arpa 0.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:23:11] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2042.codfw.wmnet 20.0.192.10.in-addr.arpa 0.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:23:12] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2042 [21:24:24] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:24:26] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2042 [21:24:26] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [21:24:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:25:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T370903)', diff saved to https://phabricator.wikimedia.org/P67836 and previous config saved to /var/cache/conftool/dbconfig/20240826-212458-ladsgroup.json [21:25:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [21:25:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [21:25:07] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:25:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T370903)', diff saved to https://phabricator.wikimedia.org/P67837 and previous config saved to /var/cache/conftool/dbconfig/20240826-212513-ladsgroup.json [21:25:14] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52483 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:25:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:19] !log catrope@deploy1003 catrope, trainbranchbot: Continuing with sync [21:27:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T370903)', diff saved to https://phabricator.wikimedia.org/P67838 and previous config saved to /var/cache/conftool/dbconfig/20240826-212723-ladsgroup.json [21:30:30] (03PS1) 10Bartosz Dziewoński: wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 [21:31:50] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1066881|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] (duration: 31m 21s) [21:38:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T371742)', diff saved to https://phabricator.wikimedia.org/P67839 and previous config saved to /var/cache/conftool/dbconfig/20240826-213807-ladsgroup.json [21:38:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:41:37] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage [21:42:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P67840 and previous config saved to /var/cache/conftool/dbconfig/20240826-214230-ladsgroup.json [21:44:29] (03CR) 10JHathaway: "@ffurnari@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway) [21:44:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10094270 (10kzimmerman) Approved as Mikhail's manager! (Mikhail has mentioned the needs to deploy Airflow pipelines. Let me know if other questions come up that I c... [21:45:00] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage [21:45:41] (03CR) 10JHathaway: "though pcc shows the metaparams being removed, in my testing the metaparams are still taken into account when applying, they are just no l" [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway) [21:47:56] (03PS1) 10Bartosz Dziewoński: logging: Use '??=' operator to reduce repetition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 [21:49:07] (03PS1) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T366431) [21:49:29] (03PS2) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T366431) [21:50:22] (03PS3) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) [21:53:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P67841 and previous config saved to /var/cache/conftool/dbconfig/20240826-215314-ladsgroup.json [21:57:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P67842 and previous config saved to /var/cache/conftool/dbconfig/20240826-215738-ladsgroup.json [22:00:13] (03PS2) 10Superpes15: [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264) [22:00:21] jouncebot: nowandnext [22:00:21] For the next 0 hour(s) and 59 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240826T2100) [22:00:21] In 3 hour(s) and 59 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0200) [22:00:44] (03CR) 10Zabe: [C:03+2] [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264) (owner: 10Superpes15) [22:00:59] (03PS3) 10Zabe: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15) [22:01:09] !log bking@dns1004.wikimedia.org `sudo -i authdns-update` T364364 [22:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:16] T364364: Provision DNS and certificates for wdqs graph split domains - https://phabricator.wikimedia.org/T364364 [22:01:29] (03Merged) 10jenkins-bot: [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264) (owner: 10Superpes15) [22:01:36] (03PS4) 10Zabe: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15) [22:01:38] (03CR) 10Zabe: [C:03+2] [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15) [22:02:27] (03Merged) 10jenkins-bot: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15) [22:02:44] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1051757|[sysop_plwiki] Change the logo/icon and the favicon (T368712)]], [[gerrit:1052063|[arbcom_itwiki] Enable importing from itwiki (T369264)]] [22:02:50] T368712: Change sysop_plwiki logo and favicon - https://phabricator.wikimedia.org/T368712 [22:02:50] T369264: Enable importing from itwiki on arbcom_itwiki - https://phabricator.wikimedia.org/T369264 [22:04:48] !log zabe@deploy1003 superpes, zabe: Backport for [[gerrit:1051757|[sysop_plwiki] Change the logo/icon and the favicon (T368712)]], [[gerrit:1052063|[arbcom_itwiki] Enable importing from itwiki (T369264)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:05:21] !log zabe@deploy1003 superpes, zabe: Continuing with sync [22:05:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2042.codfw.wmnet with OS bullseye [22:05:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094339 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [22:07:14] (03PS6) 10Superpes15: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) [22:07:15] (03CR) 10Zabe: [C:03+2] Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) (owner: 10Superpes15) [22:07:58] (03Merged) 10jenkins-bot: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) (owner: 10Superpes15) [22:08:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P67843 and previous config saved to /var/cache/conftool/dbconfig/20240826-220821-ladsgroup.json [22:30:46] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:31:52] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 473, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:33:40] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:33:40] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:33:40] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:30] !log running homer 'cr*codfw*' commit 'T372878' (remove old BGP session config for kubernetes2018, kubernetes2025) [22:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:34] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [22:36:40] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:36:40] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:38:05] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:00] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 555, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:39:26] (03PS1) 10Jasmine_: admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909 [22:39:40] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:40:20] (03CR) 10CI reject: [V:04-1] admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909 (owner: 10Jasmine_) [22:41:09] this is a fun one. what changed [22:41:41] sukhe: the ntp alerts, or the BGP ones that just resolved? [22:42:18] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:42:51] NTP ones, looking [22:44:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P67850 and previous config saved to /var/cache/conftool/dbconfig/20240826-224426-ladsgroup.json [22:48:41] ah I see it [22:48:51] the alert hosts changed [22:49:00] the other question I have is why didn't this alert before [22:49:10] but that's for later I guess, we should restart ntp.service. running the cookbook [22:49:33] (03PS2) 10Jasmine_: admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909 [22:51:08] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox [22:51:26] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:51:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:51:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:51:34] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:51:58] ok recoveries should come but this takes some time as we have a grace sleep of 15 minutes between each hosts for the NTP sync [22:52:11] nothing to worry here as such and should not affect anything else [22:52:29] thanks, sukhe! [22:52:34] thanks <3 [22:52:38] (03CR) 10RLazarus: [C:03+2] admin: renamed jfk to jasmine [puppet] - 10https://gerrit.wikimedia.org/r/1066909 (owner: 10Jasmine_) [22:59:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T370903)', diff saved to https://phabricator.wikimedia.org/P67851 and previous config saved to /var/cache/conftool/dbconfig/20240826-225933-ladsgroup.json [22:59:39] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:00:26] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:01:04] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:01:44] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:02:45] ^ should clear up as the cookbook progresses [23:06:54] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:07:22] PROBLEM - Host kubernetes2018 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:16] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:23:42] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:31:10] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:18] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:54:46] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [23:57:11] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections