[00:00:09] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:03:39] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073911 (owner: 10TrainBranchBot) [00:10:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159232 (10phaultfinder) [00:11:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:14:05] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:14:07] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:16:25] FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:25] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:17:27] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:20:45] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:21:25] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:07] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:26:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:27:25] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:27:27] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:31:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1056:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1056 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:35:07] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:36:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:25] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:38:27] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:41:03] !log force-reboot of puppetserver1001 via ipmitool (unresponsive for over 30m) [00:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:56] swfrench-wmf: thanks! might need to kick start the service as well since it doesn't have network-online.target [which we should add, which I thought I will open a task last time but then I forgot :] [00:42:31] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:43:21] sukhe: ack, thank you! just to confirm, you mean systemctl start for puppet server itself? [00:43:25] yep [00:43:36] the sync-puppet-volatile service [00:43:51] in fact, I can run up and do it [00:45:04] !log sukhe@puppetserver1002:~$ sudo systemctl start sync-puppet-volatile.service [00:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:07] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:45:16] ah, thanks, sukhe! [00:46:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:37] !log sudo cumin 'puppetserver1003* or puppetserver2003*' 'systemctl start sync-puppet-volatile.service' [00:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:46] (03PS6) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [00:47:01] (03CR) 10CI reject: [V:04-1] Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 (owner: 10Jdlrobson) [00:48:25] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:48:27] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:48:43] sukhe: alright, now I see what you meant - thank you again! [00:49:47] swfrench-wmf: no worries, thanks for fixing puppetserver otherwise we would have failed runs overnight [00:51:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:33] hm, do we need to bounce sync-puppet-volatile on puppetserver1002, 2002 also? [00:52:59] looks good there, not sure why the recover hasn't come in [00:53:02] 1002 at least [00:53:26] oh sorry I missed your first log long [00:53:29] log line [00:54:01] interestingly enough, 1002 says [00:54:02] > puppetserver needs restarting check /run/puppetserver/restart_required [00:54:05] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:54:09] Restart required from Wed Sep 18 01:10:08 PM UTC 2024 [00:54:47] I am not sure what to make of it or what the steps are so I won't touch it :] [00:55:09] never heard of that! yeah sounds like a plan [00:55:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159287 (10phaultfinder) [00:56:25] RESOLVED: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:00:45] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:14:24] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10159305 (10Papaul) @Dwisehaupt we will take care of it tomorrow [01:21:38] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:24:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns name for frack new switches - pt1979@cumin2002" [01:25:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns name for frack new switches - pt1979@cumin2002" [01:25:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:49:49] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10159341 (10ssingh) @RobH: ` sukhe@cumin1002:~$ sudo cumin 'A:cp' 'dmesg -T | grep -q -i "core temperature is above" && echo "CPU throttled due to high temperature" || echo "CPU is OK"' 112 hosts... [01:53:11] PROBLEM - Disk space on seaborgium is CRITICAL: DISK CRITICAL - free space: / 634 MB (3% inode=92%): /tmp 634 MB (3% inode=92%): /var/tmp 634 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops [02:38:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:22] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159383 (10phaultfinder) [03:28:42] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10159384 (10Papaul) 05Open→03Resolved switch interface clean up done ` papaul@fasw-c-codfw# run show interfaces descriptions ge-[0-1... [03:29:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes1056:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1056 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:31:33] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [03:43:25] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:46:25] FIRING: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:48:27] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [03:50:15] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159398 (10phaultfinder) [04:04:55] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:16] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159399 (10phaultfinder) [04:09:55] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:13] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151 (10Papaul) 03NEW [04:18:37] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159412 (10Papaul) [04:23:29] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159413 (10Papaul) @Jhancock.wm if you have some time this week or next week can you please check in rack C8 all the servers that have only 1G network card and... [04:23:36] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159414 (10Papaul) p:05Triage→03Medium [04:24:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159415 (10Papaul) [04:49:28] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159423 (10Papaul) [04:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:02:48] FIRING: PuppetFailure: Puppet has failed on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:30:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:53] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159450 (10phaultfinder) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T0600) [06:00:04] marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 5%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69312 and previous config saved to /var/cache/conftool/dbconfig/20240919-060510-arnaudb.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:55] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 10%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69313 and previous config saved to /var/cache/conftool/dbconfig/20240919-062016-arnaudb.json [06:35:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 15%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69314 and previous config saved to /var/cache/conftool/dbconfig/20240919-063521-arnaudb.json [06:41:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159452 (10ABran-WMF) all actionnable machines are ready to be depooled. I'll start depooling 20/15min before 16:00 UTC [06:47:47] log cleanup some old Bacula restores (4G) on seaborgium [06:47:49] !log cleanup some old Bacula restores (4G) on seaborgium [06:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 25%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69315 and previous config saved to /var/cache/conftool/dbconfig/20240919-065026-arnaudb.json [06:53:06] !log adding Tiziano to pwstore [06:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:11] RECOVERY - Disk space on seaborgium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops [07:00:04] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:04:02] easy [07:05:31] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10159512 (10ArthurTaylor) Are there any additional steps I need to take here... [07:05:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 50%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69316 and previous config saved to /var/cache/conftool/dbconfig/20240919-070532-arnaudb.json [07:11:33] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.008 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [07:12:48] RESOLVED: PuppetFailure: Puppet has failed on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:16:25] FIRING: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:01] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 16509 [07:20:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 75%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69317 and previous config saved to /var/cache/conftool/dbconfig/20240919-072037-arnaudb.json [07:21:14] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10159527 (10hashar) @Ladsgroup Arthur already has shell access to production... [07:24:49] (03CR) 10Elukey: [C:03+1] "It looks good, I am a bit unsure about the move from mac_address defaulting to None to str: '', but if you tried the script and everything" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067960 (owner: 10Ayounsi) [07:26:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:26:58] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10159532 (10elukey) puppetserver1001 is also working with the new settings, it was rebooted today after trashing. [07:27:09] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10159535 (10hashar) >>! In T177826#10049124, @Dzahn wrote: > Turns out there is another jenkins SSH key h... [07:35:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 100%: post maintenance', diff saved to https://phabricator.wikimedia.org/P69318 and previous config saved to /var/cache/conftool/dbconfig/20240919-073543-arnaudb.json [07:37:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:43] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 16509 [07:39:35] (03PS1) 10Muehlenhoff: Remove obsolete entries in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074089 (https://phabricator.wikimedia.org/T355653) [07:40:41] (03CR) 10Ayounsi: [C:03+2] ProvisionServer: add types [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067960 (owner: 10Ayounsi) [07:42:32] (03Merged) 10jenkins-bot: ProvisionServer: add types [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067960 (owner: 10Ayounsi) [07:43:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16509 [07:45:58] (03PS7) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [07:48:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet [07:49:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159580 (10ops-monitoring-bot) Draining ganeti2018.codfw.wmnet of running VMs [07:52:20] (03PS1) 10DCausse: cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) [07:52:22] (03PS1) 10DCausse: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) [07:56:07] (03PS2) 10DCausse: cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) [07:56:07] (03PS2) 10DCausse: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) [07:58:07] (03CR) 10Elukey: [C:03+1] Revert "Remove puppetmaster1003 from active Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1073860 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [07:58:13] (03CR) 10Elukey: [C:03+1] Remove obsolete entries in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074089 (https://phabricator.wikimedia.org/T355653) (owner: 10Muehlenhoff) [07:58:32] (03CR) 10Filippo Giunchedi: [C:03+2] hiera: set cluster for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/1073776 (https://phabricator.wikimedia.org/T375066) (owner: 10Filippo Giunchedi) [07:59:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet [08:00:04] jnuche and dduvall: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T0800) [08:00:24] morning, I will roll forward the train in a bit [08:00:46] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse) [08:01:03] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:03] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:57] ah the train, I forgot about that :) [08:02:39] we went to schedule switching the CI Jenkins to Java 17 this morning in half an hour. I guess we will sync up with you jnuche :) [08:03:57] hashar: ack, I'm starting the rollout now [08:04:20] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074093 (https://phabricator.wikimedia.org/T373642) [08:04:22] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074093 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:04:32] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [08:04:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [08:05:06] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074093 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:06:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:05] jnuche: o/ is it possible to wait to rollout the train? Like ~30 mins [08:07:12] (03CR) 10Muehlenhoff: [C:03+2] Point puppet_merge_server to puppetserver1001 [puppet] - 10https://gerrit.wikimedia.org/r/1073451 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [08:07:21] we are moving puppet-merge to a new server, it shouldn't impact but better be safe [08:07:28] elukey: sry, I already kicked it off [08:07:30] we have to upgrade the CI Jenkins as well [08:07:31] :( [08:07:39] (03CR) 10Filippo Giunchedi: "Will need rebasing, though LGTM as first patch to merge when the decom time comes" [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [08:07:43] okok [08:07:49] it shouldn't be a big issue [08:08:11] I guess next time schedule it on https://wikitech.wikimedia.org/wiki/Deployments to avoid a conflict? :) [08:08:16] but yeah I guess it will be fine [08:08:29] worse case, we get some puppet related notifications here [08:08:40] while train is monitored elsewhere [08:08:44] should be fine :] [08:09:41] make sense yes, we didn't think about it when scheduling [08:09:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:38] (03PS2) 10Muehlenhoff: Remove obsolete entries in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074089 (https://phabricator.wikimedia.org/T355653) [08:11:30] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete entries in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074089 (https://phabricator.wikimedia.org/T355653) (owner: 10Muehlenhoff) [08:16:03] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:03] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:57] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.23 refs T373642 [08:17:01] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [08:19:25] wmf.23 synced, give me a few mins to verify everything looks healthy [08:20:48] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.7 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1074097 [08:24:02] (03PS1) 10Brouberol: Release new base.helper and base.meta mopdules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074098 [08:24:03] (03PS1) 10Brouberol: Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 [08:24:43] (03PS2) 10Brouberol: Release new base.helper and base.meta modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074098 [08:27:56] hashar, elukey: things looks stable, I'm done with the train for now [08:28:05] (03PS3) 10Brouberol: Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 [08:28:08] cool , lets do the CI Jenkins [08:28:10] hashar: I hope everything goes well with the Jenkins update! [08:28:50] eoghan: we can do the java 17 upgrade https://gerrit.wikimedia.org/r/c/operations/puppet/+/1069327/ :) [08:29:23] elukey: Do you want me to hold off merging puppet changes for the moment until you're done? [08:29:39] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:30:00] (03PS4) 10Brouberol: Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 [08:30:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:30:13] oh the puppet master is switched so we can't puppet-merge correct? [08:30:14] :) [08:31:21] the new host to use for puppet-merge is puppetserver1001 from now on, I am syncing with Moritz to see if everything is done and we can let people use it [08:31:34] <_joe_> !log deployed conftool 3.2.4 T375059 T373449 [08:31:39] eoghan: go ahead! You'll be the first to use it other than me and Moritz :) [08:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:40] T375059: Requestctl sync writes unchanged objects - https://phabricator.wikimedia.org/T375059 [08:31:41] T373449: Extract an api class for requestctl - https://phabricator.wikimedia.org/T373449 [08:31:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [08:32:04] elukey: I feel honoured. Assuming my muscle memory doesn't kick in between now and when I go to merge... [08:32:12] hashar: Ok, landing that first change now. [08:32:27] eoghan: :D lemme know how it goes [08:32:31] (03CR) 10EoghanGaffney: [C:03+2] contint: switch Jenkins to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [08:33:15] (03CR) 10Brouberol: airflow: allow the webserver and scheduler to be selectively deployed (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [08:33:48] (03PS1) 10Ayounsi: Remove Netbox 3 backward compatibility [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074100 [08:34:03] elukey: All fine. Even had that new server smell. [08:34:26] \o/ [08:34:31] :) [08:36:30] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v1.2.7 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1074097 (owner: 10Volans) [08:37:05] hashar: Puppet dry run looks good, going to run it for real now [08:39:28] Sep 19 08:37:26 contint1002 systemd[1]: jenkins.service: Current command vanished from the unit file, execution of the command list won't be resumed. [08:39:29] :) [08:39:45] (03CR) 10JMeybohm: [C:03+2] Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:40:14] hashar: I guess that's true. How's jenkins looking? [08:40:22] all fine [08:40:27] gonna restart it with java 17 [08:40:36] `java --version` is 17, so the default is updated correctly. [08:40:42] the systemd service is explicitly NOT managed by Puppet [08:41:51] and it does not use `java` [08:41:55] but an explicit path ExecStart=/usr/lib/jvm/java-17-openjdk-amd64/bin/java [08:41:56] :) [08:42:00] why? well I don't know :) [08:42:29] I think I have made it this way to avoid a magic upgrade of java [08:42:44] I am waiting for https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php74/18241/console to complete [08:42:52] and will restart the controller once that joave has completed [08:42:56] Yep yep [08:43:18] (03PS1) 10Elukey: sre.hosts.decommission: update location of puppet private [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [08:43:47] once the controller comes back, it would ssh to contint1002 / contint2002 and launche the agent there using `/usr/bin/java` [08:43:53] and we would have fully switched to Java 17 \o/ [08:43:58] Great. [08:44:14] (03PS2) 10Elukey: sre.hosts.decommission: update location of puppet private [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [08:44:14] which is when we will find out something is terribly broken in the stack somewhere [08:44:20] I'll run the change on contint2002 now [08:44:40] (03PS1) 10Joal: EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) [08:45:15] (03CR) 10Volans: sre.hosts.decommission: update location of puppet private (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [08:45:32] !log Restarting CI Jenkins with Java 17 # T359795 [08:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:36] T359795: Switch Jenkins instances from Java 11 to Java 17 - https://phabricator.wikimedia.org/T359795 [08:45:37] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the follow up" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074100 (owner: 10Ayounsi) [08:45:58] (03CR) 10CI reject: [V:04-1] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal) [08:46:07] Sep 19 08:45:45 contint1002 systemd[1]: Started Jenkins Continuous Integration Server. [08:46:40] (03PS3) 10Elukey: sre.hosts.decommission: update location of puppet private [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [08:46:47] (03CR) 10Elukey: sre.hosts.decommission: update location of puppet private (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [08:48:23] lrwxrwxrwx 1 jenkins-agent jenkins-agent 0 Sep 19 08:45 /proc/2991084/exe -> /usr/lib/jvm/java-17-openjdk-amd64/bin/java [08:48:41] eoghan: so yeah success thank you! I will monitor it over the day [08:48:48] Wonderful [08:48:54] (03CR) 10Ayounsi: [C:03+2] Remove Netbox 3 backward compatibility [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074100 (owner: 10Ayounsi) [08:48:54] and later we can merge the follow up change to remove jdk 11 (and manually purge the packages) [08:49:03] I'll leave things as they are today and then tomorrow morning I'll remove java11. [08:49:03] thank you! [08:49:09] yeah sounds good [08:50:29] (03CR) 10Ayounsi: [C:03+2] Enable sftp-server [homer/public] - 10https://gerrit.wikimedia.org/r/947715 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi) [08:50:58] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1246.eqiad.wmnet with OS bookworm [08:51:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [08:51:27] (03PS2) 10Joal: EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) [08:51:45] (03CR) 10Volans: "sorry spotted another detail" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [08:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:56:32] (03Abandoned) 10Ayounsi: DHCP: add option 97 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/777799 (https://phabricator.wikimedia.org/T304677) (owner: 10Ayounsi) [08:56:38] (03Abandoned) 10Ayounsi: DHCP: use option 97 by default [cookbooks] - 10https://gerrit.wikimedia.org/r/777805 (https://phabricator.wikimedia.org/T304677) (owner: 10Ayounsi) [08:59:08] (03PS4) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [08:59:13] (03CR) 10CI reject: [V:04-1] sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [08:59:16] (03CR) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [09:00:08] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.7 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1074097 (owner: 10Volans) [09:00:08] (03Merged) 10jenkins-bot: Remove Netbox 3 backward compatibility [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074100 (owner: 10Ayounsi) [09:00:09] (03Merged) 10jenkins-bot: Enable sftp-server [homer/public] - 10https://gerrit.wikimedia.org/r/947715 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi) [09:03:00] (03CR) 10DCausse: EventStreamConfig: Disable regex steam hadoop ingestion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal) [09:04:04] (03PS1) 10Volans: Upstream release v1.2.7 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1074110 [09:10:00] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1246.eqiad.wmnet with OS bookworm [09:10:16] (03PS1) 10Volans: mysql_legacy: reorder CORE_SECTIONS constant [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 [09:10:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [09:10:39] (03CR) 10Cathal Mooney: [C:03+1] "Apologies for the delay somehow missed this one. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [09:11:26] (03PS1) 10Hashar: jenkins: print stacktraces to logs [puppet] - 10https://gerrit.wikimedia.org/r/1074112 [09:11:37] (03Abandoned) 10Cathal Mooney: Add new BGP group for cross-rack PyBal peerings at L3 POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1020843 (https://phabricator.wikimedia.org/T362772) (owner: 10Cathal Mooney) [09:11:46] 06SRE, 10Observability-Metrics: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10159827 (10fgiunchedi) [09:11:51] (03CR) 10Hashar: "The same we did on https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/55" [puppet] - 10https://gerrit.wikimedia.org/r/1074112 (owner: 10Hashar) [09:12:17] jnuche: can you double check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074112 please ? :) [09:12:33] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10159801 (10ovasileva) [09:12:37] (03CR) 10CI reject: [V:04-1] sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [09:12:40] that is for CI Jenkins to log exception stacktrace the same you did at https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/55 :) [09:12:45] I literally copy pasted the line [09:12:57] (03PS1) 10JMeybohm: ferm: Allow to specify a different ferm-status command to use [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) [09:13:11] (03PS5) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [09:13:16] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:14:33] (03CR) 10Jaime Nuche: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1074112 (owner: 10Hashar) [09:14:51] hashar: looks good! I can't +2 it though [09:14:58] thanks :) [09:15:08] (03CR) 10Elukey: [C:04-1] "Not working yet, need more time to check" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [09:15:17] eoghan: would you mind merging in a logging ocnfig change for Jenkins please ? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074112 [09:15:46] while it restarted I spotted a warning which was pretty useless without the stacktrace :] [09:15:49] and the exception message [09:16:25] (03CR) 10EoghanGaffney: [C:03+2] jenkins: print stacktraces to logs [puppet] - 10https://gerrit.wikimedia.org/r/1074112 (owner: 10Hashar) [09:16:41] 06SRE, 10Observability-Metrics: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10159831 (10fgiunchedi) [09:16:41] 🦄 [09:16:43] hashar: Sure. contint* again? [09:16:50] yeah just puppet merge it I guess [09:17:19] I'll restart Jenkins tomorrow morning to have the change applied [09:18:18] (03PS2) 10JMeybohm: ferm: Allow to specify a different ferm-status command to use [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) [09:18:23] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:18:34] hashar: Should have thought about it when reviewing, but just a thought -- update the comment above to include what %6 is. [09:18:49] oh [09:19:29] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10159837 (10cmooney) 05Open→03Resolved a:03cmooney [09:20:32] (03CR) 10Volans: [C:03+2] Upstream release v1.2.7 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1074110 (owner: 10Volans) [09:21:35] (03PS3) 10JMeybohm: ferm: Allow to specify a different ferm-status command to use [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) [09:21:49] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:23:27] !log btullis@cumin1002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [09:23:41] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1246.eqiad.wmnet with OS bookworm [09:24:32] (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2005 [puppet] - 10https://gerrit.wikimedia.org/r/1072695 (https://phabricator.wikimedia.org/T374688) (owner: 10JMeybohm) [09:24:38] (03CR) 10Btullis: [V:03+1 C:03+2] Move the misc_crons dumper role from snapshot1017 to snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis) [09:26:18] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159844 (10phaultfinder) [09:26:21] (03CR) 10Jelto: [V:03+1 C:04-1] "I agree, let's keep the change in Gerrit in case the issue happens again. Then we can do more in depth troubleshooting and see if this is " [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:28:07] (03CR) 10JMeybohm: "Puppet does still reload ferm, even though ferm-status should not have detected a diff. This change is only to make debugging easier/possi" [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:30:34] (03CR) 10JMeybohm: [C:03+1] services: update Tegola's Docker image to pick up package upgrades [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey) [09:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 17.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:33:10] (03CR) 10JMeybohm: [C:03+1] services: remove old poolcounter nodes from MW's net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:34:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 48.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:34:53] (03Merged) 10jenkins-bot: Upstream release v1.2.7 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1074110 (owner: 10Volans) [09:36:52] (03CR) 10JMeybohm: [C:03+1] ipoid: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073443 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:36:57] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073443 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:37:00] (03CR) 10JMeybohm: [C:03+1] ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414) (owner: 10Effie Mouzeli) [09:38:10] (03Merged) 10jenkins-bot: ipoid: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073443 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:39:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 3s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:39:24] (03PS1) 10Hashar: jenkins: document SimpleFormatter.format arguments [puppet] - 10https://gerrit.wikimedia.org/r/1074121 [09:39:48] (03CR) 10Effie Mouzeli: [C:03+1] wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:40:04] (03CR) 10Hashar: "I have refreshed the list of arguments passed to the log formatter using https://docs.oracle.com/en/java/javase/17/docs/api/java.logging/j" [puppet] - 10https://gerrit.wikimedia.org/r/1074121 (owner: 10Hashar) [09:40:09] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414) (owner: 10Effie Mouzeli) [09:40:15] eoghan: I have updated the comment with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074121 :) [09:40:30] Wonderful, thanks [09:40:40] (03CR) 10EoghanGaffney: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1074121 (owner: 10Hashar) [09:40:51] (03CR) 10EoghanGaffney: [C:03+2] jenkins: document SimpleFormatter.format arguments [puppet] - 10https://gerrit.wikimedia.org/r/1074121 (owner: 10Hashar) [09:40:54] (03CR) 10JMeybohm: [C:03+2] wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:40:57] \o/ [09:40:58] (03CR) 10JMeybohm: [C:03+2] wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:41:06] (03Merged) 10jenkins-bot: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414) (owner: 10Effie Mouzeli) [09:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:41:37] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [09:42:12] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [09:42:21] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:42:23] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:44:05] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:44:07] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:44:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 33.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:44:51] !log testing purged 0.24 in cp4038 - T334078 [09:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:55] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [09:45:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 8.333% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:45:35] (03CR) 10Btullis: [V:03+1 C:03+2] "For the record, this didn't work perfectly, but it was OK." [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis) [09:48:29] !log uploaded python3-wmflib_1.2.7 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [09:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:00] (03PS6) 10Brouberol: airflow: allow the scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [09:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 17.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:50:49] (03CR) 10Brouberol: airflow: allow the scheduler to be selectively deployed (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [09:51:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:52:15] (03PS1) 10Filippo Giunchedi: Remove check_procs alerts for statsd and statsv [puppet] - 10https://gerrit.wikimedia.org/r/1074122 (https://phabricator.wikimedia.org/T357099) [09:53:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:55:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:56:10] (03PS1) 10Filippo Giunchedi: ganeti: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074123 (https://phabricator.wikimedia.org/T357099) [09:56:12] (03PS1) 10Filippo Giunchedi: kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) [09:56:14] (03PS1) 10Filippo Giunchedi: samplicator: remove absented check [puppet] - 10https://gerrit.wikimedia.org/r/1074125 (https://phabricator.wikimedia.org/T357099) [09:57:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:58:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10159915 (10ABran-WMF) I think this could be related to storage as I've tried to reimage the machine without any luck. {F57523367} Using a vmedia I've been able to boot on a distro {F5752... [09:58:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 16s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:58:24] (03CR) 10Filippo Giunchedi: [C:03+2] Remove check_procs alerts for statsd and statsv [puppet] - 10https://gerrit.wikimedia.org/r/1074122 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [09:58:58] (03CR) 10Hashar: [C:03+1] "I think that will do it yes :)" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1000) [10:01:47] 06SRE, 10observability, 13Patch-For-Review: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#10159954 (10MoritzMuehlenhoff) [10:02:05] (03CR) 10Hashar: [C:03+1] gerrit::proxy: fix link target for gerrit logo [puppet] - 10https://gerrit.wikimedia.org/r/1073308 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [10:02:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:03:07] 06SRE, 06Traffic, 13Patch-For-Review: Migrate purged away from cergen-issued certificate - https://phabricator.wikimedia.org/T360506#10159937 (10MoritzMuehlenhoff) 05Open→03Resolved a:03CDobbins @CDobbins FYI, I'm assigning this to you and resolve it, given you've completed all the work [10:04:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Warning re: excessive directory entries on prometheus with puppet7 - https://phabricator.wikimedia.org/T351643#10159958 (10MoritzMuehlenhoff) [10:04:16] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10159959 (10MoritzMuehlenhoff) [10:06:12] 06SRE, 10Cloud-VPS, 10observability, 10Observability-Logging, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10159964 (10fgiunchedi) [10:06:25] 06SRE, 06cloud-services-team, 10Cloud-VPS, 10observability, 13Patch-For-Review: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623#10159965 (10fgiunchedi) [10:07:01] 06SRE, 10Cloud-VPS, 10observability, 10Observability-Logging, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10159968 (10fgiunchedi) [10:07:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#10159969 (10MoritzMuehlenhoff) We have migrated puppet merges to puppetserver1001, so this is not a blocker anymore to the shutdown of the... [10:07:34] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#10159971 (10MoritzMuehlenhoff) [10:07:38] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#10159972 (10MoritzMuehlenhoff) [10:08:39] (03PS1) 10Volans: sre.switchdc.databases.prepare: add check [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) [10:08:43] (03PS1) 10Volans: sre.switchdc.databases: update Phabricator more [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) [10:10:58] 06SRE, 10Observability-Logging, 13Patch-For-Review: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624#10159987 (10fgiunchedi) [10:11:01] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10159988 (10fgiunchedi) [10:11:45] (03CR) 10Volans: sre.switchdc.databases: update Phabricator more (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [10:12:59] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10159989 (10MoritzMuehlenhoff) >>! In T368023#10126061, @elukey wrote: > Next and last step - wait for the new conftool releas... [10:14:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074123 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [10:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:18:13] (03CR) 10Clément Goubert: [C:03+1] deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [10:18:20] (03CR) 10Volans: "Actual order to be decided, this is a conversation starter." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [10:19:00] (03CR) 10Elukey: [C:03+2] services: remove old poolcounter nodes from MW's net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:20:27] RECOVERY - MegaRAID on analytics1074 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:21:03] (03CR) 10Clément Goubert: [C:03+1] Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [10:21:13] (03PS1) 10Effie Mouzeli: ipoid: update cronjob.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074131 [10:21:57] !log elukey@deploy1003 Started scap sync-world: Remove network policies for old poolcounter nodes. [10:22:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10160016 (10MoritzMuehlenhoff) ganeti2018 is drained [10:22:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:22:55] (03CR) 10Ayounsi: [C:03+2] Enable GNMI on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/953963 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi) [10:23:01] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [10:23:13] !log rebalance ganeti group C following the various switch maintenances T370630 [10:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:17] T370630: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630 [10:23:31] (03Merged) 10jenkins-bot: Enable GNMI on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/953963 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi) [10:24:49] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: update cronjob.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074131 (owner: 10Effie Mouzeli) [10:25:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10160027 (10phaultfinder) [10:25:46] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10160028 (10cmooney) 05Open→03Resolved In the end we got away without needing this, thanks to data-persistence. I'll close for now and we can re-open if... [10:25:59] (03Merged) 10jenkins-bot: ipoid: update cronjob.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074131 (owner: 10Effie Mouzeli) [10:26:17] (03PS1) 10Cathal Mooney: Remove IDs for ESI LAGs on codfw spines to row c/d legacy switches [homer/public] - 10https://gerrit.wikimedia.org/r/1074133 (https://phabricator.wikimedia.org/T364095) [10:26:28] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:26:30] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:26:37] !log enable gNMI on cloudsw [10:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:44] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:26:46] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:28:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:29:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [10:29:18] (03PS1) 10Brouberol: Create production and staging NS for mw-dump-rev-content-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074135 (https://phabricator.wikimedia.org/T368787) [10:29:41] (03CR) 10Gmodena: [C:03+1] Create production and staging NS for mw-dump-rev-content-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074135 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [10:30:00] !log elukey@deploy1003 Finished scap sync-world: Remove network policies for old poolcounter nodes. (duration: 08m 55s) [10:30:34] (03CR) 10Ayounsi: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1074133 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [10:31:25] (03PS1) 10Brouberol: deployment_server: create prod/staging users for mw-dump-rev-content-reconcile-enrich [puppet] - 10https://gerrit.wikimedia.org/r/1074136 (https://phabricator.wikimedia.org/T368787) [10:31:56] (03CR) 10Gmodena: [C:03+1] deployment_server: create prod/staging users for mw-dump-rev-content-reconcile-enrich [puppet] - 10https://gerrit.wikimedia.org/r/1074136 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [10:32:02] (03CR) 10CI reject: [V:04-1] deployment_server: create prod/staging users for mw-dump-rev-content-reconcile-enrich [puppet] - 10https://gerrit.wikimedia.org/r/1074136 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [10:32:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:33:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:33:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:33:30] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4036/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074136 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [10:34:24] (03PS2) 10Brouberol: deployment_server: create mw-dump-rev-content-reconcile-enrich users [puppet] - 10https://gerrit.wikimedia.org/r/1074136 (https://phabricator.wikimedia.org/T368787) [10:35:31] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetmaster1003 from active Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1073860 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [10:36:56] (03CR) 10Brouberol: [C:03+2] deployment_server: create mw-dump-rev-content-reconcile-enrich users [puppet] - 10https://gerrit.wikimedia.org/r/1074136 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [10:37:27] (03CR) 10Brouberol: [C:03+2] Create production and staging NS for mw-dump-rev-content-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074135 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [10:37:30] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:37:33] (03PS1) 10Ayounsi: Add cloudsw to gNMIc targets [puppet] - 10https://gerrit.wikimedia.org/r/1074139 [10:38:05] (03PS2) 10Ayounsi: Add cloudsw to gNMIc targets [puppet] - 10https://gerrit.wikimedia.org/r/1074139 [10:38:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:38:21] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074139 (owner: 10Ayounsi) [10:38:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:43:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:46:36] (03PS1) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [10:47:30] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:49:23] (03CR) 10Cathal Mooney: [C:03+1] Add cloudsw to gNMIc targets [puppet] - 10https://gerrit.wikimedia.org/r/1074139 (owner: 10Ayounsi) [10:49:45] (03CR) 10Muehlenhoff: [C:03+1] "Or just leave it in, it might came in handy at a later point as well." [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [10:51:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:51:18] (03PS1) 10Effie Mouzeli: ipoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074141 [10:52:52] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074141 (owner: 10Effie Mouzeli) [10:53:55] (03Merged) 10jenkins-bot: ipoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074141 (owner: 10Effie Mouzeli) [10:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 14.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:55:58] !log T375078 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bhwiki --logwiki=metawiki 'SikAnderAhmedas' 'Renamed user ab8e0a47aa0e5d456f28ee3977f8c682' [10:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:02] T375078: Unblock stuck global rename of SikAnderAhmedas - https://phabricator.wikimedia.org/T375078 [10:56:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:56:47] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [10:57:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:59:30] (03PS2) 10Muehlenhoff: Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/1073791 [11:00:26] y [11:01:30] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:02:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:02:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.083s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:03:36] (03CR) 10Ayounsi: [C:03+2] Add cloudsw to gNMIc targets [puppet] - 10https://gerrit.wikimedia.org/r/1074139 (owner: 10Ayounsi) [11:04:09] (03PS1) 10Btullis: Add dummy secrets for the rclone backup on db1208 [labs/private] - 10https://gerrit.wikimedia.org/r/1074143 (https://phabricator.wikimedia.org/T372908) [11:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:05:05] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:05:58] (03CR) 10Ayounsi: [C:03+1] samplicator: remove absented check [puppet] - 10https://gerrit.wikimedia.org/r/1074125 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [11:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:06:30] (03CR) 10Muehlenhoff: [C:03+2] Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/1073791 (owner: 10Muehlenhoff) [11:07:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.083s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:09:33] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160179 (10ovasileva) [11:09:36] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160180 (10ovasileva) [11:09:46] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160181 (10ovasileva) [11:09:52] (03PS2) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:10:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:11:00] (03PS3) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:14:23] PROBLEM - Host analytics1076 is DOWN: PING CRITICAL - Packet loss = 100% [11:15:06] (03PS4) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:15:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:15:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4039/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:20:07] (03PS5) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:20:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:20:52] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4040/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:21:09] (03PS8) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) [11:21:38] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy secrets for the rclone backup on db1208 [labs/private] - 10https://gerrit.wikimedia.org/r/1074143 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:27:03] (03PS1) 10Muehlenhoff: No longer include config-master on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) [11:27:08] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4041/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:28:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:28:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [11:29:03] (03CR) 10JMeybohm: [C:03+2] ferm: Allow to specify a different ferm-status command to use [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [11:29:14] (03PS6) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:29:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4042/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:31:27] (03CR) 10Tacsipacsi: "Done, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:31:37] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [11:31:38] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f1-eqiad [11:32:00] (03PS7) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:32:43] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4043/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:33:02] (03PS1) 10Effie Mouzeli: Revert "ipoid: fix typo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074152 [11:33:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:34:08] !log ayounsi@cumin2002 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [11:34:10] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f1-eqiad [11:38:00] (03CR) 10Effie Mouzeli: [C:03+2] Revert "ipoid: fix typo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074152 (owner: 10Effie Mouzeli) [11:39:28] (03Merged) 10jenkins-bot: Revert "ipoid: fix typo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074152 (owner: 10Effie Mouzeli) [11:39:56] (03PS8) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:40:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4044/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:41:38] (03PS1) 10Effie Mouzeli: ipoid: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074153 [11:43:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:43:59] (03PS9) 10Btullis: Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) [11:44:38] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074153 (owner: 10Effie Mouzeli) [11:44:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4045/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [11:45:48] (03Merged) 10jenkins-bot: ipoid: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074153 (owner: 10Effie Mouzeli) [11:47:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:50:58] ^ we know [11:52:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:53:18] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160354 (10ayounsi) [11:55:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160375 (10cmooney) 05Resolved→03Open [11:56:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:56:39] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160348 (10ayounsi) a:05cmooney→03ayounsi Blocked on {T365012} to be able to renew the certs. Other than that, manually tested and works as exp... [11:56:52] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160350 (10cmooney) 05Open→03Resolved This has been enabled following the cloudsw upgrades. (see https://gerrit.wikimedia.org/r/c/operations/p... [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1200) [12:00:49] (03PS1) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:01:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:01:28] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:01:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ProofreadPage] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073879 (https://phabricator.wikimedia.org/T375114) (owner: 10Sohom Datta) [12:02:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:03:35] (03CR) 10CI reject: [V:04-1] profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:06:30] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:07:01] (03PS1) 10Gmodena: config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587) [12:07:24] (03PS1) 10Btullis: Configure a bacula fileset and job for db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) [12:08:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4046/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [12:10:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 8.333% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:11:33] (03PS2) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:12:39] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 17.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:17:29] (03PS3) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:17:31] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:20:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:22:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:26:14] (03PS4) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:26:23] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:30:59] (03PS5) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:31:14] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:31:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10160460 (10VRiley-WMF) a:03VRiley-WMF [12:31:53] (03CR) 10David Caro: "This broke puppet runs on all the cloud VMs:" [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:32:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:32:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:32:45] (03PS1) 10Effie Mouzeli: cronjobs: update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074165 [12:33:15] (03PS1) 10David Caro: firewall: add missing cloud.yaml entry [puppet] - 10https://gerrit.wikimedia.org/r/1074166 [12:34:34] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4047/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074166 (owner: 10David Caro) [12:35:10] (03PS6) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:35:26] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1074167 [12:35:28] (03CR) 10David Caro: "Fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074166" [puppet] - 10https://gerrit.wikimedia.org/r/1074113 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:35:51] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:35:58] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1074166 (owner: 10David Caro) [12:36:40] (03CR) 10David Caro: [V:03+1 C:03+2] firewall: add missing cloud.yaml entry [puppet] - 10https://gerrit.wikimedia.org/r/1074166 (owner: 10David Caro) [12:37:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:38:44] (03PS7) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:38:59] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:39:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:41:20] (03PS1) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [12:42:18] (03CR) 10CI reject: [V:04-1] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [12:44:51] (03PS2) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [12:46:49] (03CR) 10CI reject: [V:04-1] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [12:49:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:50:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10160506 (10phaultfinder) [12:52:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:53:03] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10160507 (10VRiley-WMF) 05Open→03Resolved Atempted to rebalance power. [12:53:57] 06SRE, 06Traffic: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837#10160509 (10Daimona) [12:53:58] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10160510 (10VRiley-WMF) Is there an acceptable time to swap out the DIMM? We can proceed at any time. [12:55:02] (03PS8) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [12:55:46] (03PS3) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [12:56:19] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:56:42] (03CR) 10CI reject: [V:04-1] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [12:59:24] (03PS9) 10JMeybohm: profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1300). [13:00:04] seanleong-wmde and Sohom_Datta: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [13:00:33] o/ [13:00:36] I might be able to deploy in a few minutes [13:00:42] or in 20 minutes. we’ll see [13:02:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:03:14] (03CR) 10Slyngshede: [C:03+1] "Looks Good." [dns] - 10https://gerrit.wikimedia.org/r/1074167 (owner: 10Muehlenhoff) [13:04:18] (03CR) 10Brouberol: [C:03+1] Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:04:40] (03CR) 10Btullis: [V:03+1 C:03+2] Render the rclone config file for db1208 postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/1074140 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:05:39] (03PS2) 10Btullis: Configure a bacula fileset and job for db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) [13:07:04] (03CR) 10Brouberol: [C:03+1] Configure a bacula fileset and job for db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:07:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:09:34] (03PS4) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [13:10:38] (03CR) 10CI reject: [V:04-1] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [13:11:09] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10160587 (10jijiki) Alright, we are looking into it [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:48] (03PS2) 10Filippo Giunchedi: samplicator: remove absented check [puppet] - 10https://gerrit.wikimedia.org/r/1074125 (https://phabricator.wikimedia.org/T357099) [13:12:12] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] samplicator: remove absented check [puppet] - 10https://gerrit.wikimedia.org/r/1074125 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:12:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:12:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [13:15:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [13:15:42] (03CR) 10JMeybohm: [C:03+2] profile::firewall: Absent confd config when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1074155 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [13:16:01] (03PS1) 10Elukey: profile::puppetserver: correctly populate the MASTERS env variable [puppet] - 10https://gerrit.wikimedia.org/r/1074176 (https://phabricator.wikimedia.org/T374443) [13:16:02] (03CR) 10Jcrespo: [C:03+1] "Not a blocker, good to go. But check the postgres setup of netbox, which is closer to ideal, as we do regular full backups rather than inc" [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:35] (03PS2) 10Filippo Giunchedi: kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) [13:17:54] (03PS2) 10Elukey: profile::puppetserver: correctly populate the MASTERS env variable [puppet] - 10https://gerrit.wikimedia.org/r/1074176 (https://phabricator.wikimedia.org/T374443) [13:17:59] (03CR) 10Jcrespo: [C:03+1] "backups of netbox's postgres I meant." [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:18:43] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4049/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074176 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:19:01] (03PS2) 10Filippo Giunchedi: ganeti: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074123 (https://phabricator.wikimedia.org/T357099) [13:20:13] (03PS1) 10Effie Mouzeli: ipoid: tmp fix for jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074177 [13:20:35] alright, I can deploy now [13:21:07] (03PS3) 10Elukey: profile::puppetserver: correctly populate the MASTERS env variable [puppet] - 10https://gerrit.wikimedia.org/r/1074176 (https://phabricator.wikimedia.org/T374443) [13:21:07] seanleong-wmde: are you there? [13:21:11] (03CR) 10Filippo Giunchedi: [C:03+2] kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:21:27] (03CR) 10Filippo Giunchedi: [C:03+2] ganeti: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074123 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:21:31] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] ganeti: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074123 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:21:44] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:22:03] (03PS3) 10Filippo Giunchedi: kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) [13:22:14] (03PS2) 10Effie Mouzeli: ipoid: tmp fix for jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074177 [13:22:15] let’s start with Sohom_Datta then [13:22:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:22:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/ProofreadPage] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073879 (https://phabricator.wikimedia.org/T375114) (owner: 10Sohom Datta) [13:22:35] Sounds good :) [13:23:12] I was a bit skeptical about reverting those color changes, but the commit came from the same person who changed the colors in the first place, so I guess that’s okay ^^ [13:23:18] hopefully someone from the design team can look into it eventually [13:23:37] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: tmp fix for jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074177 (owner: 10Effie Mouzeli) [13:23:56] Yeah, we need a expanded color palette to implement the change properly [13:24:45] (03Merged) 10jenkins-bot: ipoid: tmp fix for jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074177 (owner: 10Effie Mouzeli) [13:26:06] (03CR) 10Btullis: "Thanks Jaime. Yes, I'd be keen to keep an eye out for efficiency improvements, once this is in place. This is a brand new type of backup f" [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:26:11] (03CR) 10Btullis: [C:03+2] Configure a bacula fileset and job for db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074156 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [13:27:25] (03PS1) 10Slyngshede: UI for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [13:27:41] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [13:27:42] (03PS4) 10Filippo Giunchedi: kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) [13:27:56] Lucas_WMDE Hi yes, I'm here [13:28:01] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:28:10] (03CR) 10Muehlenhoff: [C:03+1] "Good catch! Didn't think about this when adapting the Puppet classes wrt the new puppet_merge_server variable" [puppet] - 10https://gerrit.wikimedia.org/r/1074176 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:28:36] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [13:28:54] seanleong-wmde: hi! I’ll deploy your patch once Sohom_Datta’s backport is done then [13:29:01] Okay, thank you! [13:30:08] (03CR) 10Hashar: P:idp More precise base_dn for user lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930) (owner: 10Slyngshede) [13:31:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:32:17] (03CR) 10Elukey: [C:03+2] profile::puppetserver: correctly populate the MASTERS env variable [puppet] - 10https://gerrit.wikimedia.org/r/1074176 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:33:08] btullis: o/ shall I merge yours? [13:34:07] (03CR) 10Abijeet Patro: [C:04-1] "Planning to review how we do this. See: T375190" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054869 (https://phabricator.wikimedia.org/T335342) (owner: 10Wangombe) [13:34:38] (03CR) 10Ssingh: "Thanks for updating wmflib volans! Is this good to go from your end now?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [13:35:14] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [13:35:43] (03CR) 10Lucas Werkmeister (WMDE): Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (owner: 10Seanleong-wmde) [13:35:44] (03PS9) 10CDobbins: sre.dns.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [13:35:53] btullis: seemed harmless, merged :) [13:36:01] (03CR) 10Volans: [C:03+1] "No onjections, no blockers from my end. I'll leave the fine details and testing to you and your team :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [13:36:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:39:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 22.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:41:00] (03CR) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (owner: 10Seanleong-wmde) [13:41:16] (03PS3) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) [13:42:38] Lucas_WMDE Hi, I added the bug ticket number in the commit message! [13:42:57] (03CR) 10Lucas Werkmeister (WMDE): Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [13:42:59] !log update pfw codfw syslog target - T374658 [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:03] thank, I just added another comment there [13:43:06] *thanks [13:43:37] (03PS4) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) [13:44:14] (03CR) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [13:44:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:44:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [13:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 10.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:45:24] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1054918/4050/" [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [13:45:28] (03Abandoned) 10Gergő Tisza: SUL3: Use mobile domain in SSO URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069678 (https://phabricator.wikimedia.org/T371596) (owner: 10Gergő Tisza) [13:45:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:46:00] (03CR) 10Elukey: [C:03+1] "LGTM! Are you going to do a manual cleanup afterwards, or probably not needed?" [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:47:17] That's a lot of tests for a 10 line CSS fix :( [13:47:36] yeah… [13:47:51] (03CR) 10Muehlenhoff: "I would manually clean out the Apache site, so that we can verify there's no hidden use of the cert remaining. The rest can vanish when we" [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:50:10] yay, it’s almost done [13:50:18] (03Merged) 10jenkins-bot: Bring back quality colors before dark mode fixes [extensions/ProofreadPage] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073879 (https://phabricator.wikimedia.org/T375114) (owner: 10Sohom Datta) [13:50:36] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1073879|Bring back quality colors before dark mode fixes (T375114)]] [13:50:40] T375114: Recent color changes have negatively impacted page proofreading on en.ws - https://phabricator.wikimedia.org/T375114 [13:51:44] (03CR) 10Filippo Giunchedi: [C:03+2] kerberos: remove absented checks [puppet] - 10https://gerrit.wikimedia.org/r/1074124 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [13:53:01] damn, I didn’t realize how late it already is [13:53:02] jouncebot: next [13:53:02] In 1 hour(s) and 6 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1500) [13:53:06] ok, we’ve got some more time [13:53:10] (03CR) 10Elukey: [C:03+1] "Nono I was wondering what's best, your cleanup is fine, +1!" [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:55:43] hm, “1 K8s nodes failed to pull the multiversion image” [13:55:46] 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, 07Wikimedia-production-error: Some POST of thumbnails to Swift time out - https://phabricator.wikimedia.org/T374911#10160870 (10hnowlan) These all appear to be requests from jobrunner hosts, which leads me to assume they're from the ThumbnailRender job. C... [13:55:49] (namely wikikube-worker1001) [13:56:57] !log lucaswerkmeister-wmde@deploy1003 soda, lucaswerkmeister-wmde: Backport for [[gerrit:1073879|Bring back quality colors before dark mode fixes (T375114)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:57:00] Sohom_Datta: please test the ProofreadPage fix :) [13:57:03] T375114: Recent color changes have negatively impacted page proofreading on en.ws - https://phabricator.wikimedia.org/T375114 [13:57:47] !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [13:57:48] !log elukey@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f1-eqiad [13:58:15] Works :) [13:58:31] !log lucaswerkmeister-wmde@deploy1003 soda, lucaswerkmeister-wmde: Continuing with sync [13:58:33] ok \o/ [13:58:53] !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [13:58:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f1-eqiad [13:59:01] !log sudo cumin "A:cp" 'disable-puppet "merging CR 1054918"' [13:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:24] (03PS1) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [13:59:49] Thank you for deploying :) [14:00:37] (03PS2) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [14:00:37] np :) [14:00:46] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [14:01:38] (03PS3) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [14:02:11] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [14:02:50] (03PS5) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) [14:03:16] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073879|Bring back quality colors before dark mode fixes (T375114)]] (duration: 12m 39s) [14:03:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, we'll probably also need a brief CLI later to easily retrieve the status from the JSON log." [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 (owner: 10Slyngshede) [14:03:20] T375114: Recent color changes have negatively impacted page proofreading on en.ws - https://phabricator.wikimedia.org/T375114 [14:03:32] hm, scap exited with non-zero exit status [14:03:35] * Lucas_WMDE scrolls up [14:03:57] I don’t see any other errors or warnings so I guess that’s just due to the one node that failed to pull [14:04:02] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10160898 (10fnegri) This requires a change to the wiki replicas view definition... [14:04:06] but if the actual rollout went fine then I think that can be ignored [14:04:18] it just had to pull the image during the deployment then, rather than in advance 🤷 [14:04:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [14:04:49] (03CR) 10CI reject: [V:04-1] ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [14:05:27] (03Merged) 10jenkins-bot: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [14:05:38] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1073428|Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]] [14:05:42] T66315: Move "Data item" link into In Other Projects section of sidebar - https://phabricator.wikimedia.org/T66315 [14:06:23] I’ve not followed this task at all, so I’ll be interested to see what happens to https://uk.wikipedia.org/wiki/%D0%92%D1%96%D0%BA%D1%96%D0%BF%D0%B5%D0%B4%D1%96%D1%8F:%D0%90%D0%B2%D1%82%D0%BE%D1%80%D1%81%D1%8C%D0%BA%D1%96_%D0%BF%D1%80%D0%B0%D0%B2%D0%B0 in a moment [14:06:41] (example page with both a connected Wikidata item and a sitelinked Wikidata page) [14:07:03] (03PS4) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [14:07:26] Lucas_WMDE Ah, the wikidata item link should be shifted to the in other projects section after the change [14:07:58] (03PS1) 10Brouberol: airflow: move datahub-related config to global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074186 [14:09:03] (03PS2) 10Brouberol: airflow: move datahub-related config to global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074186 [14:09:55] hm, docker_pull_k8s is taking some time again [14:09:58] wonder if it’s the same node [14:10:15] (03CR) 10CI reject: [V:04-1] ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [14:10:19] (03PS5) 10JMeybohm: ferm: Use ferm-status to start ferm on diffs [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) [14:10:38] yup, wikikube-worker1001 again [14:10:50] I think that’s worth a task [14:10:55] (03CR) 10Btullis: [C:03+1] airflow: move datahub-related config to global values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074186 (owner: 10Brouberol) [14:11:08] Oh, may I know what does it mean? haha [14:11:46] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074185 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [14:11:50] !log lucaswerkmeister-wmde@deploy1003 seanleong-wmde, lucaswerkmeister-wmde: Backport for [[gerrit:1073428|Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:55] T66315: Move "Data item" link into In Other Projects section of sidebar - https://phabricator.wikimedia.org/T66315 [14:12:00] (03CR) 10Brouberol: airflow: move datahub-related config to global values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074186 (owner: 10Brouberol) [14:12:02] (03CR) 10Brouberol: [C:03+2] airflow: move datahub-related config to global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074186 (owner: 10Brouberol) [14:12:20] seanleong-wmde: can you test the change on mwdebug? [14:12:30] the wikikube-worker1001 issue is probably nothing for you to worry about ^^ [14:12:40] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10160970 (10ayounsi) It would indeed be great to have redundancy for the `fmsw`, but as that device is not managed, there is a risk of creating... [14:13:00] Lucas_WMDE Okay!, I'll test it now [14:13:21] (03PS1) 10KartikMistry: Update MinT to 2024-09-19-120927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074187 [14:13:22] 10ops-eqiad, 06DC-Ops: wikikube-worker1001 failed to docker pull on two consecutive deployments - https://phabricator.wikimedia.org/T375201 (10Lucas_Werkmeister_WMDE) 03NEW [14:13:44] 10ops-eqiad, 06DC-Ops: wikikube-worker1001 failed to docker pull on two consecutive deployments - https://phabricator.wikimedia.org/T375201#10160983 (10Lucas_Werkmeister_WMDE) I have frankly no idea what tags to add to this… #ops-eqiad is just a wild guess. But given that it happened twice, it feels worth inve... [14:13:47] task ^ [14:14:10] I am restarting the CI Jenkins [14:14:55] seanleong-wmde: looks like the other projects section now has both “Wikidata” and “Wikidata item”? is that right? [14:15:11] I’m not convinced that’s less confusing, but I’m not the PM ;) [14:15:12] It is working on this page https://uk.wikipedia.org/wiki/%D0%97%D0%B5%D0%BC%D0%BB%D1%8F, but not on the page https://uk.wikipedia.org/wiki/%D0%92%D1%96%D0%BA%D1%96%D0%BF%D0%B5%D0%B4%D1%96%D1%8F:%D0%90%D0%B2%D1%82%D0%BE%D1%80%D1%81%D1%8C%D0%BA%D1%96_%D0%BF%D1%80%D0%B0%D0%B2%D0%B0 that you sent. [14:15:15] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10160987 (10Dreamy_Jazz) >>! In T371486#10160897, @fnegri wrote: > @Ladsgroup c... [14:15:17] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [14:15:32] seanleong-wmde: it was working on the second page for me, after force-reloading it [14:15:45] (03CR) 10Ssingh: "Confirmed the safety checks with @bblack@wikimedia.org that they make sense to him as well. Merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [14:15:51] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [14:16:29] unless what I’m seeing is not what it’s supposed to do, I guess :D [14:16:48] Sorry, do you mind teaching me how to force-reload? [14:16:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:16:54] '=D [14:17:01] (03PS7) 10Brouberol: airflow: allow the scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [14:17:03] Ctrl+F5 ^^ [14:17:20] (03PS6) 10Ssingh: purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:17:24] my next attempt after that would’ve been ?action=purge in the URL but apparently it wasn’t necessary this time [14:18:01] and then, under “ В інших проєктах ” (“in other projects”, I assume), I see both “Вікідані” (link to Wikidata:Copyright on Wikidata) and “Елемент Вікіданих” (link to the Wikidata item) [14:18:43] Ah I saw it now [14:19:01] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4051/console" [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:19:09] okay, does that mean it’s working correctly? [14:19:30] Yup, it's working correctly [14:19:31] (03CR) 10Ssingh: [V:03+1 C:03+2] purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:19:31] (03CR) 10Ssingh: [V:03+2 C:03+2] purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:19:37] !log lucaswerkmeister-wmde@deploy1003 seanleong-wmde, lucaswerkmeister-wmde: Continuing with sync [14:19:42] alright, then let’s deploy it \o/ [14:20:28] the CI Jenkins is back :) [14:20:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 depool T373579', diff saved to https://phabricator.wikimedia.org/P69332 and previous config saved to /var/cache/conftool/dbconfig/20240919-142046-arnaudb.json [14:20:50] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [14:21:18] Thank you so much for the deployment! Lucas_WMDE [14:21:31] it’s not done yet :P [14:21:32] but np ^^ [14:23:30] (03CR) 10Hashar: "I have verified the exception message and stacktrace are logged :)" [puppet] - 10https://gerrit.wikimedia.org/r/1074112 (owner: 10Hashar) [14:24:09] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073428|Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]] (duration: 18m 31s) [14:24:14] T66315: Move "Data item" link into In Other Projects section of sidebar - https://phabricator.wikimedia.org/T66315 [14:24:28] (03PS2) 10Mforns: Modify service commons-impact-analytics to use data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073880 (https://phabricator.wikimedia.org/T368035) [14:24:35] scap exited nonzero again, I assume it’s the same reason [14:24:50] !log UTC afternoon backport+config window done [14:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:31] Lucas_WMDE :b [14:25:41] (03PS1) 10Hnowlan: mediawiki: remove check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/1074189 (https://phabricator.wikimedia.org/T374860) [14:25:46] !log sudo cumin -b11 "A:cp" 'run-puppet-agent --enable "merging CR 1054918"' [14:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:10] (03PS1) 10Arnaudb: mariadb: remove db2129 [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) [14:27:24] (03CR) 10Arnaudb: "as part of sanity checks (https://phabricator.wikimedia.org/T375186) I've found an instance that needed to be depooled (done) and decommis" [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:27:27] jouncebot: nowandnext [14:27:27] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [14:27:27] In 0 hour(s) and 32 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1500) [14:28:06] Lucas_WMDE: what failed with scap? I can't find the error in logstash [14:28:19] hashar: T375201 [14:28:19] beside sync-world "returned non-zero exit status 1. " [14:28:20] T375201: wikikube-worker1001 failed to docker pull on two consecutive deployments - https://phabricator.wikimedia.org/T375201 [14:28:33] oh [14:28:42] apparently it should be fixed now but I don’t really have a third deployment to try it ^^ [14:29:26] I got it now: Sep 19, 2024 @ 14:10:30.516deploy1003 ERR sync-world 1 K8s nodes failed to pull the multiversion image [14:30:11] so yeah that got fixed \o/ [14:30:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.445s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:30:32] (03PS1) 10Dreamy Jazz: Add CheckUserQueryInterface to autoload classes [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074192 (https://phabricator.wikimedia.org/T375203) [14:30:41] (03PS1) 10Dreamy Jazz: Add CheckUserQueryInterface to autoload classes [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1074193 (https://phabricator.wikimedia.org/T375203) [14:30:55] I want to deploy Lucas, so can test it :D [14:31:00] sure :D [14:31:09] (03CR) 10Dreamy Jazz: [C:03+2] Add CheckUserQueryInterface to autoload classes [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074192 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [14:31:12] (03CR) 10Dreamy Jazz: [C:03+2] Add CheckUserQueryInterface to autoload classes [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1074193 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [14:31:12] I was wondering if you’d seen the error yesterday and not said anything about it ^^ [14:31:21] but probably the firewall changes weren’t yet in effect then [14:31:27] I didn't see it yesterday AFAIK [14:31:29] so the disabled puppet wouldn’t have ha a bad effect [14:31:46] (03Abandoned) 10Bking: rdf-streaming-updater: trigger a savepoint before firewall changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072597 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [14:31:52] Scap exited normally when I was running it. [14:32:06] (i.e. zero status code) [14:34:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074192 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [14:34:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1074193 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [14:35:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.445s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:37:17] (03PS1) 10Elukey: requestctl: modify comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074194 (https://phabricator.wikimedia.org/T374443) [14:37:27] (03CR) 10Elukey: [V:03+2 C:03+2] requestctl: modify comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074194 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:38:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:05] (03CR) 10Jcrespo: [C:03+1] "Ok, assuming you are going to run the decom right away, if not consider setting it as insetup if you are going to wait for actual decom." [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:40:22] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10161219 (10elukey) The move was done and everything seems to work as expected! I noticed that when we puppet-merge labs-private... [14:40:50] !log manual replace of mtail binary on centrallog2002 (3.0.0-rc50 to 3.0.8) T375085 [14:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:54] T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002 - https://phabricator.wikimedia.org/T375085 [14:41:16] (03CR) 10Ssingh: sre.dns.pdns-recursor: add rolling restart script (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:41:24] (03CR) 10Jcrespo: "if right away, I would remove the hiera key fully. So either full rm or add it as insetup::data_persistence" [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:42:14] (03PS2) 10Arnaudb: mariadb: remove db2129 [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) [14:42:23] (03PS3) 10Arnaudb: mariadb: remove db2129 [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) [14:42:35] (03CR) 10Arnaudb: "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:43:00] (03CR) 10Jcrespo: [C:03+1] mariadb: remove db2129 [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:43:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10161254 (10MoritzMuehlenhoff) >>! In T374443#10161219, @elukey wrote: > The move was done and everything seems to work as expect... [14:43:58] (03CR) 10Arnaudb: "will run T375207 right away" [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:43:59] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2129 [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:44:20] (03CR) 10Jcrespo: [C:03+1] "thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1074188 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [14:45:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2129.codfw.wmnet [14:46:24] !log installing expat security updates on Bookworm [14:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:47] (03PS8) 10Andrea Denisse: alert: Ensure Prometheus Alertmanager starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) [14:50:00] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [14:53:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161341 (10jcrespo) ms backups con codfw are stopped. As usual, not asking for priority over my workmates, but if you can not... [14:53:08] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2129.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [14:53:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2129.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [14:53:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2129.codfw.wmnet [14:53:34] (03CR) 10Brouberol: [C:03+2] airflow: allow the scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [14:55:11] (03PS1) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) [14:55:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:56:13] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:56:16] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4053/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [14:56:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:56:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 depool T373579', diff saved to https://phabricator.wikimedia.org/P69333 and previous config saved to /var/cache/conftool/dbconfig/20240919-145626-arnaudb.json [14:56:30] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [14:56:44] fixed the dbctl thingy, sorry for the noise [14:57:07] (03PS2) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) [14:58:27] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [14:58:40] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [14:59:20] (03CR) 10JHathaway: [C:03+1] No longer include config-master on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [14:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:04] jnuche and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1500). [15:00:07] (03PS6) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [15:01:11] (03Merged) 10jenkins-bot: Add CheckUserQueryInterface to autoload classes [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074192 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [15:01:13] (03Merged) 10jenkins-bot: Add CheckUserQueryInterface to autoload classes [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1074193 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [15:01:14] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:01:16] (03CR) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [15:01:37] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1074192|Add CheckUserQueryInterface to autoload classes (T375203)]], [[gerrit:1074193|Add CheckUserQueryInterface to autoload classes (T375203)]] [15:01:41] T375203: Run populateCentralCheckUserIndexTables.php on WMF wikis - https://phabricator.wikimedia.org/T375203 [15:01:58] (03CR) 10Andrea Denisse: alert: Ensure Prometheus Alertmanager starts at boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse) [15:02:02] (03PS3) 10Mforns: Modify service commons-impact-analytics to use data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073880 (https://phabricator.wikimedia.org/T368035) [15:02:42] !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [15:02:43] !log elukey@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f1-eqiad [15:04:02] (03PS3) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) [15:04:18] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1074192|Add CheckUserQueryInterface to autoload classes (T375203)]], [[gerrit:1074193|Add CheckUserQueryInterface to autoload classes (T375203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:04:21] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [15:05:11] (03CR) 10Santiago Faci: [C:03+2] Modify service commons-impact-analytics to use data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073880 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [15:07:14] (03Merged) 10jenkins-bot: Modify service commons-impact-analytics to use data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073880 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [15:08:55] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074192|Add CheckUserQueryInterface to autoload classes (T375203)]], [[gerrit:1074193|Add CheckUserQueryInterface to autoload classes (T375203)]] (duration: 07m 18s) [15:08:59] T375203: Run populateCentralCheckUserIndexTables.php on WMF wikis - https://phabricator.wikimedia.org/T375203 [15:09:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=(cp2041|cp2042).codfw.wmnet [reason: depool for T373105] [15:09:49] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [15:11:49] (03CR) 10Dzahn: "this broke puppet in our cloud VPS test instances since suddenly we have to set a new parameter. " Function lookup() did not find a value" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:13:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161445 (10ssingh) Traffic hosts (cp2041/cp2042) are depooled. [15:16:27] (03PS1) 10Dzahn: envoy: set downstream and upstream idle timeout hiera keys for cloud [puppet] - 10https://gerrit.wikimedia.org/r/1074198 (https://phabricator.wikimedia.org/T373517) [15:16:59] (03CR) 10Dzahn: "please also set new Hiera defaults in cloud: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074198" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:17:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10161456 (10lmata) [15:18:02] (03CR) 10Hnowlan: [C:03+1] envoy: set downstream and upstream idle timeout hiera keys for cloud [puppet] - 10https://gerrit.wikimedia.org/r/1074198 (https://phabricator.wikimedia.org/T373517) (owner: 10Dzahn) [15:18:31] (03PS1) 10Ryan Kemper: wdqs: allow 3 new federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074199 (https://phabricator.wikimedia.org/T364233) [15:19:24] (03CR) 10Dzahn: [C:03+2] "thanks for the quick review" [puppet] - 10https://gerrit.wikimedia.org/r/1074198 (https://phabricator.wikimedia.org/T373517) (owner: 10Dzahn) [15:20:44] (03PS1) 10JHathaway: vrts_aliases: query database for valid addresses [puppet] - 10https://gerrit.wikimedia.org/r/1074200 (https://phabricator.wikimedia.org/T374090) [15:22:13] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: testing cookbook for actual change, no task ID specified] [15:22:25] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site ulsfo [reason: testing cookbook for actual change, no task ID specified] [15:24:33] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [15:26:26] Didn't see any problems when deploying, so that problem appears to be fixed. [15:26:37] Although the backport didn't actually fix the problem I wanted to fix. [15:26:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-eqiad [15:27:08] (03CR) 10Volans: [C:03+1] "LGTM, one nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [15:27:08] (03CR) 10Brouberol: [C:03+1] cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [15:27:25] (03CR) 10Brouberol: [C:03+1] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [15:29:24] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:29:39] (03PS1) 10Dreamy Jazz: Call require_once on CheckUserQueryInterface in population script [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074203 (https://phabricator.wikimedia.org/T375203) [15:31:51] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2052.codfw.wmnet [15:32:24] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2052.codfw.wmnet [15:32:34] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2053.codfw.wmnet [15:33:10] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2053.codfw.wmnet [15:33:21] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2282.codfw.wmnet [15:34:49] (03CR) 10Dreamy Jazz: [C:03+2] Call require_once on CheckUserQueryInterface in population script [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074203 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [15:35:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074203 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [15:36:20] jouncebot: nowandnext [15:36:20] For the next 0 hour(s) and 23 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1500) [15:36:20] In 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1600) [15:36:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2282.codfw.wmnet [15:36:42] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2018.codfw.wmnet [15:37:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2018.codfw.wmnet [15:37:28] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2019.codfw.wmnet [15:38:14] (03PS1) 10Ssingh: sre.dns.admin: fix call to set_and_verify [cookbooks] - 10https://gerrit.wikimedia.org/r/1074205 [15:38:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2131 db2152 db2173 db2174 db2181 db2182 db2195 db2219 db2220 es2040 - T373105', diff saved to https://phabricator.wikimedia.org/P69336 and previous config saved to /var/cache/conftool/dbconfig/20240919-153815-arnaudb.json [15:38:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: network maintenance T373105 [15:38:19] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [15:38:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 10 hosts with reason: network maintenance T373105 [15:39:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161570 (10ABran-WMF) all data-persistence hosts have been depooled and downtimed [15:39:29] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:40:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2019.codfw.wmnet [15:40:49] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2020.codfw.wmnet [15:40:57] (03PS10) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [15:41:25] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2020.codfw.wmnet [15:42:21] (03CR) 10Ssingh: "Expanding this for internal recursors as well." [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [15:43:06] (03PS4) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) [15:44:08] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4055/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [15:44:40] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [15:44:43] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:45:13] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [15:45:17] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [15:45:34] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:45:37] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:46:33] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:47:15] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-c8-eqiad [15:47:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:49:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-c8-eqiad [15:49:50] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [15:51:06] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-eqiad [15:52:36] (03PS11) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [15:52:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:12] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: fix call to set_and_verify [cookbooks] - 10https://gerrit.wikimedia.org/r/1074205 (owner: 10Ssingh) [15:55:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-eqiad [15:55:42] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: testing cookbook for actual change, no task ID specified] [15:55:49] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: testing cookbook for actual change, no task ID specified] [15:56:38] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-d5-eqiad [15:56:38] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:58:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-d5-eqiad [15:59:55] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqiad [16:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:48] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10161647 (10aborrero) the server has been drained, it should be ready to go at any time @VRiley-WMF thanks! [16:01:27] 👋 no gerrit patches but swfrench-wmf and I have some ops work planned, please coordinate before deploying anything [16:01:40] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1074189 (https://phabricator.wikimedia.org/T374860) (owner: 10Hnowlan) [16:03:54] 06SRE, 06Infrastructure-Foundations, 10netops: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port - https://phabricator.wikimedia.org/T375216 (10cmooney) 03NEW p:05Triage→03Low [16:04:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqiad [16:04:35] (03Merged) 10jenkins-bot: Call require_once on CheckUserQueryInterface in population script [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074203 (https://phabricator.wikimedia.org/T375203) (owner: 10Dreamy Jazz) [16:04:49] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1074203|Call require_once on CheckUserQueryInterface in population script (T375203)]] [16:04:53] T375203: Run populateCentralCheckUserIndexTables.php on WMF wikis - https://phabricator.wikimedia.org/T375203 [16:05:38] (03PS2) 10Aklapper: Weekly Phabricator data for Tech News: Add Auto-Submitted [puppet] - 10https://gerrit.wikimedia.org/r/1072536 [16:06:18] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [16:06:56] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1074203|Call require_once on CheckUserQueryInterface in population script (T375203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:07:01] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [16:08:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 22 hosts with reason: Move server uplinks in codfw rack D7 [16:09:15] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 22 hosts with reason: Move server uplinks in codfw rack D7 [16:09:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a040f2d9-1940-4aba-bd29-efa9aeec87fb) set by cmoon... [16:10:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:09] !log migrating server uplinks in codfw rack D7 to new top-of-rack switch T373105 [16:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:13] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [16:11:37] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074203|Call require_once on CheckUserQueryInterface in population script (T375203)]] (duration: 06m 47s) [16:11:40] T375203: Run populateCentralCheckUserIndexTables.php on WMF wikis - https://phabricator.wikimedia.org/T375203 [16:11:57] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, adding David as headsup re: metricsinfra" [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse) [16:12:33] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: testing done, no task ID specified] [16:12:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: testing done, no task ID specified] [16:13:24] !log Running `foreachwikiindblist group0.dblist extensions/CheckUser/maintenance/populateCentralCheckUserIndexTables.php` on a tmux session for T375203 [16:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:18] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on 25 hosts with reason: Move server uplinks in codfw rack D8 [16:16:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on 25 hosts with reason: Move server uplinks in codfw rack D8 [16:16:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161716 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9d0dd9cc-ca9d-4736-b81c-6f32f4a0772d) set by cmoon... [16:17:25] (03CR) 10Dzahn: [C:03+2] "puppet runs fixed, all good, thanks:)" [puppet] - 10https://gerrit.wikimedia.org/r/1074198 (https://phabricator.wikimedia.org/T373517) (owner: 10Dzahn) [16:17:32] !log migrating server uplinks in codfw rack D8 to new top-of-rack switch T373105 [16:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:36] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [16:21:25] (03CR) 10David Caro: [C:03+1] "Thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse) [16:22:11] (03CR) 10Andrea Denisse: [C:03+2] alert: Ensure Prometheus Alertmanager starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse) [16:22:19] (03PS1) 10Kosta Harlan: DiscussionParser: Do not create User objects from subpages [extensions/Echo] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074209 (https://phabricator.wikimedia.org/T375212) [16:22:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161819 (10cmooney) All hosts have been moved and all now responding to ping again. [16:23:15] (03CR) 10Dzahn: [V:03+1 C:03+2] "oh yea, this is just like we already have it in other scripts, so totally uncontroversial :)" [puppet] - 10https://gerrit.wikimedia.org/r/1072536 (owner: 10Aklapper) [16:23:44] (03PS3) 10Joal: EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) [16:25:06] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2129.codfw.wmnet - https://phabricator.wikimedia.org/T375207#10161822 (10ABran-WMF) p:05High→03Medium a:05ABran-WMF→03None [16:25:07] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218 (10phaultfinder) 03NEW [16:25:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69337 and previous config saved to /var/cache/conftool/dbconfig/20240919-162521-arnaudb.json [16:25:26] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [16:25:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69338 and previous config saved to /var/cache/conftool/dbconfig/20240919-162526-arnaudb.json [16:25:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69339 and previous config saved to /var/cache/conftool/dbconfig/20240919-162531-arnaudb.json [16:25:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69340 and previous config saved to /var/cache/conftool/dbconfig/20240919-162536-arnaudb.json [16:25:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69341 and previous config saved to /var/cache/conftool/dbconfig/20240919-162541-arnaudb.json [16:25:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69342 and previous config saved to /var/cache/conftool/dbconfig/20240919-162546-arnaudb.json [16:25:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69343 and previous config saved to /var/cache/conftool/dbconfig/20240919-162551-arnaudb.json [16:25:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69344 and previous config saved to /var/cache/conftool/dbconfig/20240919-162556-arnaudb.json [16:26:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69345 and previous config saved to /var/cache/conftool/dbconfig/20240919-162601-arnaudb.json [16:26:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 25%: T373105', diff saved to https://phabricator.wikimedia.org/P69346 and previous config saved to /var/cache/conftool/dbconfig/20240919-162606-arnaudb.json [16:26:28] (03CR) 10Cathal Mooney: [C:03+2] Remove IDs for ESI LAGs on codfw spines to row c/d legacy switches [homer/public] - 10https://gerrit.wikimedia.org/r/1074133 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [16:26:44] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2041.codfw.wmnet [16:26:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2041.codfw.wmnet [16:26:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2042.codfw.wmnet [16:26:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2042.codfw.wmnet [16:27:04] (03Merged) 10jenkins-bot: Remove IDs for ESI LAGs on codfw spines to row c/d legacy switches [homer/public] - 10https://gerrit.wikimedia.org/r/1074133 (https://phabricator.wikimedia.org/T364095) (owner: 10Cathal Mooney) [16:27:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=(cp2041|cp2042).codfw.wmnet [reason: T373105 is done] [16:27:49] !log restart swift-object-replicator on thanos-be2002 [16:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:08] (03CR) 10Joal: EventStreamConfig: Disable regex steam hadoop ingestion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal) [16:28:44] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10161860 (10Dzahn) Thanks! Can we just click "revoke" then? Just so nobody wonders again in the future. [16:28:45] (03PS12) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [16:28:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161862 (10ABran-WMF) d/p instances are repooling [16:31:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161877 (10MatthewVernon) ms-nodes all good; thanos-be2004 seems OK (but checking that picked up an unrelated replication issu... [16:33:57] !log testing purged 0.24 in cp2037 - T334078 [16:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:10] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [16:34:19] !log restarting confd on all servers on: codfw, ulsof, eqsin - T373105 [16:34:29] (03CR) 10Volans: [C:04-1] "The refactor goes into the right direction but there are still quite a few things to adjust/fix/improve IMHO." [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [16:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:31] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [16:36:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10161886 (10andrea.denisse) Hi Jclark-ctr, do you know what arguments I need to give mdadm to resync the drive? [16:36:59] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2052.codfw.wmnet [16:37:01] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2052.codfw.wmnet [16:37:09] (03PS13) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [16:37:17] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2053.codfw.wmnet [16:37:19] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2053.codfw.wmnet [16:37:34] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2282.codfw.wmnet [16:37:37] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2282.codfw.wmnet [16:37:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161888 (10jcrespo) Resumed ms backups on codfw. [16:37:52] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2018.codfw.wmnet [16:37:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2018.codfw.wmnet [16:38:10] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2019.codfw.wmnet [16:38:12] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2019.codfw.wmnet [16:38:18] !log disable LAG interface from asw-d-codfw to ssw1-dX-codfw [16:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:28] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2020.codfw.wmnet [16:38:29] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10161893 (10Seddon) 05In progress→03Resolved [16:38:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2020.codfw.wmnet [16:38:59] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10161894 (10Seddon) looks like its all working! Will follow up if needed. [16:40:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69347 and previous config saved to /var/cache/conftool/dbconfig/20240919-164026-arnaudb.json [16:40:31] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [16:40:31] 16:01:27 👋 no gerrit patches but swfrench-wmf and I have some ops work planned, please coordinate before deploying anything <-- this is starting shortly, we're claiming the conch :) [16:40:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69348 and previous config saved to /var/cache/conftool/dbconfig/20240919-164031-arnaudb.json [16:40:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69349 and previous config saved to /var/cache/conftool/dbconfig/20240919-164036-arnaudb.json [16:40:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69350 and previous config saved to /var/cache/conftool/dbconfig/20240919-164041-arnaudb.json [16:40:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69351 and previous config saved to /var/cache/conftool/dbconfig/20240919-164046-arnaudb.json [16:40:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69352 and previous config saved to /var/cache/conftool/dbconfig/20240919-164051-arnaudb.json [16:40:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69353 and previous config saved to /var/cache/conftool/dbconfig/20240919-164056-arnaudb.json [16:41:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69354 and previous config saved to /var/cache/conftool/dbconfig/20240919-164101-arnaudb.json [16:41:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69355 and previous config saved to /var/cache/conftool/dbconfig/20240919-164106-arnaudb.json [16:41:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 50%: T373105', diff saved to https://phabricator.wikimedia.org/P69356 and previous config saved to /var/cache/conftool/dbconfig/20240919-164111-arnaudb.json [16:42:14] hi folks, as r.zl just noted, please reach out to coordinate any deployment needs that arise. also, we expect to need the entire mediawiki infra window starting at 17:00 UTC. [16:43:28] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [16:43:47] (03PS14) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [16:44:15] !log uploaded purged 0.24 to apt.wm.o (bullseye-wikimedia) [16:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:44] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [16:45:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10161927 (10phaultfinder) [16:47:33] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:47:53] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:47:57] (03PS15) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [16:48:08] !log updated to purged 0.24 in codfw, ulsfo and eqsin - T334078 [16:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:11] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [16:48:33] !log scaling up mw-api-ext in eqiad for pre-switchover testing - T371273 [16:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:40] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [16:50:18] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:50:35] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:50:49] !log scaling up mw-web in eqiad for pre-switchover testing - T371273 [16:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:07] !log ulsfo was depooled between 15:55 and 16:12 for sre.dns.admin test, current state is pooled [16:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69357 and previous config saved to /var/cache/conftool/dbconfig/20240919-165531-arnaudb.json [16:55:36] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [16:55:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69358 and previous config saved to /var/cache/conftool/dbconfig/20240919-165537-arnaudb.json [16:55:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69359 and previous config saved to /var/cache/conftool/dbconfig/20240919-165542-arnaudb.json [16:55:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69360 and previous config saved to /var/cache/conftool/dbconfig/20240919-165546-arnaudb.json [16:55:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69361 and previous config saved to /var/cache/conftool/dbconfig/20240919-165551-arnaudb.json [16:55:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69362 and previous config saved to /var/cache/conftool/dbconfig/20240919-165556-arnaudb.json [16:56:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69363 and previous config saved to /var/cache/conftool/dbconfig/20240919-165602-arnaudb.json [16:56:05] (03PS2) 10MusikAnimal: Remove $wgCodeMirrorRTL temporary feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069293 (https://phabricator.wikimedia.org/T170001) [16:56:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69364 and previous config saved to /var/cache/conftool/dbconfig/20240919-165606-arnaudb.json [16:56:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69365 and previous config saved to /var/cache/conftool/dbconfig/20240919-165611-arnaudb.json [16:56:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 75%: T373105', diff saved to https://phabricator.wikimedia.org/P69366 and previous config saved to /var/cache/conftool/dbconfig/20240919-165617-arnaudb.json [16:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:56:59] (03PS1) 10Btullis: Enable the S3 to local backups on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074214 (https://phabricator.wikimedia.org/T372908) [16:57:45] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4056/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074214 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [16:58:32] (03CR) 10Btullis: [V:03+1 C:03+2] Enable the S3 to local backups on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074214 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [17:00:04] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1700). [17:00:04] swfrench-wmf: Your horoscope predicts another MediaWiki infrastructure (UTC late) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1700). [17:00:26] here, and starting momentarily o/ [17:00:31] 🚀 [17:02:31] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-int-ro,name=codfw [reason: Pre-switchover capacity validation - T371273] [17:02:35] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10162039 (10Dzahn) Great! Thanks @Seddon :) [17:02:35] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [17:03:16] (03PS1) 10Mforns: Fix executable name in commons-impact-metrics service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074217 (https://phabricator.wikimedia.org/T368035) [17:04:59] (03PS2) 10Mforns: Fix executable name in commons-impact-metrics service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074217 (https://phabricator.wikimedia.org/T368035) [17:08:50] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=codfw [reason: Pre-switchover capacity validation - T371273] [17:08:53] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [17:09:32] swfrench-wmf: hi, I would like to backport a production fix [17:09:41] can you let me know when you're done? [17:10:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69367 and previous config saved to /var/cache/conftool/dbconfig/20240919-171037-arnaudb.json [17:10:39] jnuche: thanks for the heads-up, I'll let you know when we're done, but it's likely this will take the full infra window (i.e., ETA 18:00 UTC) [17:10:42] T373105: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 [17:10:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69368 and previous config saved to /var/cache/conftool/dbconfig/20240919-171042-arnaudb.json [17:10:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69369 and previous config saved to /var/cache/conftool/dbconfig/20240919-171047-arnaudb.json [17:10:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69370 and previous config saved to /var/cache/conftool/dbconfig/20240919-171052-arnaudb.json [17:10:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69371 and previous config saved to /var/cache/conftool/dbconfig/20240919-171057-arnaudb.json [17:11:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69372 and previous config saved to /var/cache/conftool/dbconfig/20240919-171101-arnaudb.json [17:11:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2195 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69373 and previous config saved to /var/cache/conftool/dbconfig/20240919-171107-arnaudb.json [17:11:12] swfrench-wmf: ack, thanks, I'll wait then [17:11:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69374 and previous config saved to /var/cache/conftool/dbconfig/20240919-171112-arnaudb.json [17:11:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69375 and previous config saved to /var/cache/conftool/dbconfig/20240919-171117-arnaudb.json [17:11:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 100%: T373105', diff saved to https://phabricator.wikimedia.org/P69376 and previous config saved to /var/cache/conftool/dbconfig/20240919-171122-arnaudb.json [17:11:34] (03CR) 10BCornwall: [C:03+1] wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [17:12:02] (03PS1) 10Btullis: Add missing colon to rclone upstream config for db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074220 (https://phabricator.wikimedia.org/T372908) [17:12:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4057/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074220 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [17:12:51] (03CR) 10BCornwall: [C:03+2] varnish: Remove carriers netmap [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [17:13:18] (03CR) 10Btullis: [V:03+1 C:03+2] Add missing colon to rclone upstream config for db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074220 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [17:15:01] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4058/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [17:16:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10162104 (10phaultfinder) [17:17:16] (03CR) 10Santiago Faci: [C:03+2] Fix executable name in commons-impact-metrics service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074217 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [17:17:34] !log swfrench@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=codfw [reason: Pre-switchover capacity validation - T371273] [17:17:38] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [17:18:31] (03Merged) 10jenkins-bot: Fix executable name in commons-impact-metrics service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074217 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [17:21:47] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [17:21:55] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [17:26:59] !log Finished running script for T375203 on `group0` [17:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:03] T375203: Run populateCentralCheckUserIndexTables.php on WMF wikis - https://phabricator.wikimedia.org/T375203 [17:35:36] (03PS1) 10Scott French: Revert "mw-(api-ext|web): scale up to 75% at p95 targets" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074232 (https://phabricator.wikimedia.org/T371273) [17:36:10] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=cp5024.eqsin.wmnet [17:40:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10162199 (10phaultfinder) [17:41:09] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10162201 (10VRiley-WMF) 05Open→03Resolved This DIMM (B2) has been swapped out. Please let us know if any other issue crops up. [17:42:25] (03CR) 10RLazarus: [C:03+1] "LGTM - as discussed we may want to do a version of this permanently, but it doesn't have to be right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074232 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [17:43:01] !log dancy@deploy1003 Installing scap version "4.104.0" for 211 hosts [17:45:54] alright, things look good. I'm going to start walking back the test. [17:46:19] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-int-ro,name=codfw [reason: Reverting pre-switchover capacity validation - T371273] [17:46:23] (03CR) 10Hnowlan: [C:03+1] "Thanks for handling it, sorry for the noise!" [puppet] - 10https://gerrit.wikimedia.org/r/1074198 (https://phabricator.wikimedia.org/T373517) (owner: 10Dzahn) [17:46:25] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [17:47:10] !log dancy@deploy1003 Installation of scap version "4.104.0" completed for 211 hosts [17:49:30] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=codfw [reason: Reverting pre-switchover capacity validation - T371273] [17:51:15] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet [17:52:08] (03PS2) 10Stoyofuku-wmf: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) [17:52:25] (03CR) 10Stoyofuku-wmf: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [17:53:50] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=codfw [reason: Reverting pre-switchover capacity validation - T371273] [17:53:54] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [17:55:52] 06SRE, 06Infrastructure-Foundations, 10netops: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10162292 (10cmooney) Just a note on this task to say that I was able to perform some throughput tests on the old asw-d-codfw devices (QFX5100) which have t... [17:56:12] (03CR) 10Scott French: [C:03+2] Revert "mw-(api-ext|web): scale up to 75% at p95 targets" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074232 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [17:56:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10162293 (10cmooney) 05Open→03Resolved a:03cmooney All done with this. Big thanks for @Jhancock.wm for the amazing work m... [17:57:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#10162302 (10cmooney) >>! In T360789#9941103, @Papaul wrote: > All the cabling is done. I am leaving this task open so when we move the console cables from a... [17:57:14] (03Merged) 10jenkins-bot: Revert "mw-(api-ext|web): scale up to 75% at p95 targets" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074232 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [17:57:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10162299 (10cmooney) 05Open→03Resolved a:03cmooney [17:58:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10162308 (10cmooney) @Jhancock.wm thanks for doing this. I have completed my testing now on the old switch (thankfully all went well). So thi... [18:00:04] jnuche and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1800). nyaa~ [18:00:37] jnuche: dduvall: if you could hold for 5m or so while we verify something, that would be greatly appreciated [18:01:04] swfrench-wmf: sure thing, I'll be around [18:01:28] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on vrts1001.eqiad.wmnet with reason: Migration [18:01:42] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on vrts1001.eqiad.wmnet with reason: Migration [18:01:45] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:01:54] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:02:13] !log scaling down mw-web in eqiad after pre-switchover testing - T371273 [18:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:17] T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic - https://phabricator.wikimedia.org/T371273 [18:02:31] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:02:39] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:02:51] !log scaling down mw-api-ext in eqiad after pre-switchover testing - T371273 [18:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:58] (03CR) 10AOkoth: [C:03+2] vrts: change primary host [puppet] - 10https://gerrit.wikimedia.org/r/1073283 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [18:04:57] jnuche: dduvall: alright, I think you're good to go. thanks for your patience! [18:05:19] thanks swfrench-wmf! [18:05:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/Echo] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074209 (https://phabricator.wikimedia.org/T375212) (owner: 10Kosta Harlan) [18:11:39] (03PS1) 10JHathaway: WIP - puppet8: migrate "easy" puppet facts to structured facts [puppet] - 10https://gerrit.wikimedia.org/r/1074239 [18:12:26] (03PS2) 10JHathaway: WIP - puppet8: migrate "easy" puppet facts to structured facts [puppet] - 10https://gerrit.wikimedia.org/r/1074239 [18:12:37] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074239 (owner: 10JHathaway) [18:16:39] (03CR) 10CI reject: [V:04-1] WIP - puppet8: migrate "easy" puppet facts to structured facts [puppet] - 10https://gerrit.wikimedia.org/r/1074239 (owner: 10JHathaway) [18:17:07] (03PS3) 10AOkoth: wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420) [18:19:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10162355 (10cmooney) Actually just checking it's still at status "planned" in Netbox. And looking at puppetboard it seems it never got added p... [18:20:17] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:21:27] (03CR) 10Dzahn: [C:03+2] gerrit::proxy: fix link target for gerrit logo [puppet] - 10https://gerrit.wikimedia.org/r/1073308 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:22:54] (03CR) 10AOkoth: [C:03+2] wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [18:23:47] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove entries for sretest2002 - cmooney@cumin1002" [18:23:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove entries for sretest2002 - cmooney@cumin1002" [18:23:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:26:07] (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Profile::Gerrit::Proxy/File[/var/www/wikimedia-codereview-logo.cache.png]/ensure: ensure changed 'file' to 'link'" [puppet] - 10https://gerrit.wikimedia.org/r/1073308 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:31:22] (03Merged) 10jenkins-bot: DiscussionParser: Do not create User objects from subpages [extensions/Echo] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074209 (https://phabricator.wikimedia.org/T375212) (owner: 10Kosta Harlan) [18:31:36] !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1074209|DiscussionParser: Do not create User objects from subpages (T375212)]] [18:31:41] T375212: InvalidArgumentException: Invalid username: - https://phabricator.wikimedia.org/T375212 [18:33:29] !log jnuche@deploy1003 jnuche, kharlan: Backport for [[gerrit:1074209|DiscussionParser: Do not create User objects from subpages (T375212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:33:36] !log jnuche@deploy1003 jnuche, kharlan: Continuing with sync [18:33:57] (03PS1) 10Kosta Harlan: Add ::caller to queries in populateCentralCheckUserIndexTables.php [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074241 (https://phabricator.wikimedia.org/T375221) [18:34:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074241 (https://phabricator.wikimedia.org/T375221) (owner: 10Kosta Harlan) [18:34:20] (03CR) 10Ssingh: "Looking good, I think we are almost there:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:35:38] (03PS1) 10AOkoth: vrts: add required module for 6.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/1074242 (https://phabricator.wikimedia.org/T373420) [18:38:15] !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074209|DiscussionParser: Do not create User objects from subpages (T375212)]] (duration: 06m 38s) [18:38:19] T375212: InvalidArgumentException: Invalid username: - https://phabricator.wikimedia.org/T375212 [18:40:28] (03CR) 10Dzahn: [C:03+1] vrts: add required module for 6.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/1074242 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [18:40:48] (03CR) 10AOkoth: [C:03+2] vrts: add required module for 6.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/1074242 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [18:43:45] (03PS3) 10Hashar: contint: remove jdk-11 packages [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [18:44:07] (03CR) 10Dzahn: "I assume you wanna wait a couple days?" [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [18:58:23] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10162431 (10RobH) I think they'll want a dump of the dmesg directly for the CPU temperature incidents so we can point at where it had to throttle down at exact dates/time, since now they are saying... [19:04:42] (03PS5) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:07:12] (03PS6) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:16:17] (03PS7) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:19:35] (03PS1) 10Bking: WIP: clean up Elastic runbook links post-doc rewrite [alerts] - 10https://gerrit.wikimedia.org/r/1074247 (https://phabricator.wikimedia.org/T356806) [19:20:51] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10162474 (10ssingh) Hi @RobH: Sharing the `cumin` command the output so that you have some timestamps (UTC) ready to go (`esams` one is at the end but I am just dumping all for later use): ` sukhe... [19:26:39] (03PS1) 10Mforns: hieradata::services_proxy::envoy.yaml: fix duplicated port [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) [19:38:08] (03CR) 10Scott French: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [19:38:10] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [19:40:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10162534 (10phaultfinder) [19:42:03] (03PS6) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:42:05] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (0370 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:42:29] 70 comments in one patch, let's go [19:43:37] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:46:02] (03CR) 10Dzahn: "today would be a good day for this, not Friday yet, on-call, and out next week. no pressure, lol" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [19:47:47] jouncebot: nowandnext [19:47:47] For the next 0 hour(s) and 12 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T1800) [19:47:47] In 0 hour(s) and 12 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T2000) [19:51:58] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [19:55:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10162554 (10phaultfinder) [19:57:57] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1073305/4062/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [19:58:39] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10162557 (10RobH) I've sent over the log output for the two esam hosts to their respective support email threads, lets see what they say! Thank you! [19:59:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073871 (owner: 10C. Scott Ananian) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240919T2000) [20:00:04] toyofuku, Sohom_Datta, kostajh, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] \o [20:00:13] i'm here! [20:00:16] hiiii [20:00:19] I am here on behalf of Kosta [20:00:25] gonna deploy my patch first if that's okay with you all [20:00:44] You're first on the window, so that's fine with me :D [20:00:53] (03CR) 10Dreamy Jazz: [C:03+2] Add ::caller to queries in populateCentralCheckUserIndexTables.php [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074241 (https://phabricator.wikimedia.org/T375221) (owner: 10Kosta Harlan) [20:01:00] (03PS1) 10C. Scott Ananian: Add a "duplicate-ids" lint category [extensions/Linter] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074253 (https://phabricator.wikimedia.org/T200517) [20:01:09] I'm going to start gate-and-submit-wmf for my change, as it will take around 20 mins to complete. [20:01:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Linter] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074253 (https://phabricator.wikimedia.org/T200517) (owner: 10C. Scott Ananian) [20:02:43] thanks all [20:02:50] starting now [20:03:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:04:00] (03Merged) 10jenkins-bot: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:04:11] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1073297|Deploy new donate link location to pilot wikis (take 2) (T373585)]] [20:04:15] cscott: Did you want to schedule your second patch in this window? It's scheduled for next week. [20:04:16] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:04:30] Oh I see you've moved it now [20:05:49] Sohom_Datta: Are you around for your change? [20:06:02] !log toyofuku@deploy1003 toyofuku: Backport for [[gerrit:1073297|Deploy new donate link location to pilot wikis (take 2) (T373585)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:24] testing now! [20:08:10] amazing, proceeding [20:08:12] !log toyofuku@deploy1003 toyofuku: Continuing with sync [20:08:18] Dreamy_Jazz: I did! [20:08:56] I think i moved it on wiki, but I got distracted by the wikitech SUL migration banner and was doing my chores related to that. [20:09:05] (03CR) 10Dzahn: [V:03+1 C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:09:10] "do i want to be able to edit wikitech between Oct 1 and Nov 30", yeah probably! [20:09:14] :D [20:09:20] Definitely would be useful [20:09:46] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on prod servers" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:10:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:46] Sohom_Datta's patch seems uncontroversial, so I'd be happy to +2 it even if they are not around. [20:11:13] Dreamy_Jazz: fwiw both of my patches should have no effect on wiki, they just put dependencies in place that ensure we'll be able to test the next parsoid release (parsoid train v{N} is tested against mediawiki-core {N-1} (aka the current production version) prior to our releasing it) [20:11:37] Sure. The second one has an i18n change, so could take a while to deploy [20:11:42] although I can briefly test Linter and Parsoid to make sure nothing has exploded [20:11:44] (03CR) 10Umherirrender: Configure ContactPage and IPBE contact form on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish) [20:12:47] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073297|Deploy new donate link location to pilot wikis (take 2) (T373585)]] (duration: 08m 35s) [20:12:51] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:13:00] All done, thank you everyone! [20:13:22] cscott: Do you want to start your changes in gate-and-submit-wmf now? [20:13:43] The one in core could take a bit of time to merge. [20:13:56] yeah, might as well if it saves some time. [20:14:17] If you +2 the core one first, then the Linter won't go in until that's been merged [20:14:24] Then they can be backported together [20:14:35] Unless you wanted to have all the remaining patches deployed at once [20:15:08] i don't know anything about the other patches and their riskiness (or not) so i think i'll just batch my two together if that's alright [20:15:32] Sure, that's fine by me [20:15:59] (03CR) 10C. Scott Ananian: [C:03+2] Re-order arguments to DataAccess::addTrackingCategory [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073871 (owner: 10C. Scott Ananian) [20:16:23] (03CR) 10C. Scott Ananian: [C:03+2] Add a "duplicate-ids" lint category [extensions/Linter] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074253 (https://phabricator.wikimedia.org/T200517) (owner: 10C. Scott Ananian) [20:18:52] On second thoughts, I'm not sure I'm comfortable deploying the config change by Sohom_Datta without them being around as there is no associated ticket. [20:19:07] Unless they appear, I'm going to skip that change. [20:19:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "tested on new machine. if both apache2 is removed and /var/www/ doesn't exist, one puppet run fixes both" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:20:17] Dreamy_Jazz: I can ask the reviewer of that patch for some context [20:21:04] Which reviewer? The config change doesn't seem to have been +1'd [20:21:12] :O [20:21:17] I meant Jon who's on my team [20:21:21] It seems to depend on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/1072542 which rode this weeks train and would have taken effect on wikisource yesterday i think. [20:21:23] Oh I see. [20:21:45] I don't know much about this patch but it concerns my team [20:22:10] Ebrahim wrote the patch referenced in the config change, but i think they are not WMF staff? [20:22:46] Sure. If someone is able to +1, that would be good. Just because I can see how the patch makes sense, but I don't know if the associated patch is the only one left to do. [20:23:16] I reached out to Jon but he might be busy rn [20:23:40] pls hold off on deploying [20:23:48] Jon's gonna reply in the patch [20:23:55] Sorry to have held you up! [20:24:02] No problem. [20:24:15] My change is still 8 mins away from merging [20:24:37] and mine are 22 min (!) [20:24:56] Once it's merged, I will need to deploy within 10 or so mins so that my change can go in separately to the other changes. [20:25:11] sob [20:25:15] (03CR) 10Jdlrobson: "Will take a look before Monday, but I suggest holding of until next week, in case anything goes wrong." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [20:25:20] well thanks for keeping me company while my change went out [20:25:23] see ya [20:25:25] :D [20:25:28] Bye [20:25:49] Going to mark that second config change as not done per the comment [20:26:53] !log aqu@deploy1003 Started deploy [airflow-dags/analytics@e0d8d78]: Fix canary events generation schedule [airflow-dags/analytics@e0d8d78a] [20:27:02] (03Merged) 10jenkins-bot: Add ::caller to queries in populateCentralCheckUserIndexTables.php [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074241 (https://phabricator.wikimedia.org/T375221) (owner: 10Kosta Harlan) [20:27:36] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@e0d8d78]: Fix canary events generation schedule [airflow-dags/analytics@e0d8d78a] (duration: 00m 42s) [20:27:39] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1074241|Add ::caller to queries in populateCentralCheckUserIndexTables.php (T375221)]] [20:27:43] T375221: Lots of "SQL query did not specify the caller" warnings - https://phabricator.wikimedia.org/T375221 [20:29:27] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Backport for [[gerrit:1074241|Add ::caller to queries in populateCentralCheckUserIndexTables.php (T375221)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:31] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Continuing with sync [20:29:40] Once PHPUnit tests are parallelised, it should be quicker to backport patches on repos that are part of the gate. [20:30:07] (03PS1) 10Ebernhardson: ClosedWikiProvider: Support canAlwaysAutocreate option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) [20:34:29] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074241|Add ::caller to queries in populateCentralCheckUserIndexTables.php (T375221)]] (duration: 06m 50s) [20:34:34] T375221: Lots of "SQL query did not specify the caller" warnings - https://phabricator.wikimedia.org/T375221 [20:34:44] cscott: All over to you. [20:35:14] (03PS2) 10Ebernhardson: ClosedWikiProvider: Support canAlwaysAutocreate option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) [20:36:12] !log Running `foreachwikiindblist group1.dblist extensions/CheckUser/maintenance/populateCentralCheckUserIndexTables.php` on a tmux session for T375203 [20:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:19] T375203: Run populateCentralCheckUserIndexTables.php on WMF wikis - https://phabricator.wikimedia.org/T375203 [20:38:08] can you run the scap? i have deployer rights but try not to actually use them :) [20:38:34] looks like CI is still running anyway [20:38:38] eta 8 min [20:39:26] Sure. I can do that. [20:40:24] i ran maintenance scripts for i think the first time ever earlier this week and although everything went smoothly it was very nerve-wracking! [20:40:57] Yeah. Whenever I interact with production I always get a slightly faster heart-rate [20:41:04] my deployer rights date back to when parsoid was written in js and so we'd do node deploys of parsoid every week; i haven't really done a deploy since we put parsoid/php on the MW train a few years ago. [20:42:22] * cscott watches phpunit slowly creep past 81% [20:42:41] It is watching paint dry somewhat [20:43:22] Ah I missed the window ? [20:43:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073871 (owner: 10C. Scott Ananian) [20:43:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/Linter] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074253 (https://phabricator.wikimedia.org/T200517) (owner: 10C. Scott Ananian) [20:44:10] Hi Sohom_Datta, the window is still going. [20:44:21] However there was a comment on your patch suggesting it should be left till next week [20:44:29] So I marked it as not done [20:45:08] Sure, I see, that works :) [20:45:35] (03PS2) 10Scott French: wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) [20:45:36] (03CR) 10Scott French: "Thank you all in advance for the review. Jaime, let me know if there are any additional reviewers you'd like me to add from data-persisten" [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [20:45:44] (03Merged) 10jenkins-bot: Re-order arguments to DataAccess::addTrackingCategory [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073871 (owner: 10C. Scott Ananian) [20:45:46] (03Merged) 10jenkins-bot: Add a "duplicate-ids" lint category [extensions/Linter] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074253 (https://phabricator.wikimedia.org/T200517) (owner: 10C. Scott Ananian) [20:45:58] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073871|Re-order arguments to DataAccess::addTrackingCategory]], [[gerrit:1074253|Add a "duplicate-ids" lint category (T200517)]] [20:46:02] T200517: Emit lint error or category when a page uses duplicate HTML IDs - https://phabricator.wikimedia.org/T200517 [20:46:14] ok, we're merged, whee [20:46:29] cscott: I'm understanding that there is nothing to test with these changes then? [20:46:38] i have some minor tests queued up [20:46:43] :+1L [20:46:47] 👍 [20:47:02] https://en.wikipedia.org/wiki/Special:LintErrors/duplicate-ids should look like https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors/duplicate-ids after the sync [20:47:37] and https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&text=Hello%2C%20''World''&parsoid=1&formatversion=2 should "still not crash" -- that's just a really rough smoke test, but still worth doing to make sure parsoid hasn't exploded [20:48:50] (03PS2) 10Scott French: wmnet: update CNAME record for maintenance host to codfw [dns] - 10https://gerrit.wikimedia.org/r/1073898 (https://phabricator.wikimedia.org/T370962) [20:48:50] (03CR) 10Scott French: "Thank you both in advance for the review. This is DNS change 2 of 4 for next week's switchover." [dns] - 10https://gerrit.wikimedia.org/r/1073898 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [20:49:43] (03PS2) 10Scott French: geo-maps: update map default to list codfw first [dns] - 10https://gerrit.wikimedia.org/r/1073899 (https://phabricator.wikimedia.org/T370962) [20:49:53] It may take a while to get to the test stage, as one of the changes included an i18n change. [20:50:18] fair enough [20:50:20] (03PS2) 10Scott French: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1073900 (https://phabricator.wikimedia.org/T370962) [20:50:20] (03CR) 10Scott French: "Alright, this is the last DNS change for next week (4 of 4). Thanks again." [dns] - 10https://gerrit.wikimedia.org/r/1073900 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [20:54:13] (03PS1) 10Scott French: debug.json: order codfw (primary) DC backends first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073895 (https://phabricator.wikimedia.org/T370962) [20:54:57] (03PS1) 10Scott French: hieradata: update deployment_server to deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1073894 (https://phabricator.wikimedia.org/T370962) [20:54:57] (03CR) 10Scott French: "And thank you once again for review of this one as well." [puppet] - 10https://gerrit.wikimedia.org/r/1073894 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [20:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:59:37] (03PS1) 10CDobbins: sre.dns.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1074266 (https://phabricator.wikimedia.org/T375232) [20:59:45] subbu: the linter and core patch are syncing, but because the linter patch added i18n messages it is taking a while. [21:00:53] !log dreamyjazz@deploy1003 dreamyjazz, cscott: Backport for [[gerrit:1073871|Re-order arguments to DataAccess::addTrackingCategory]], [[gerrit:1074253|Add a "duplicate-ids" lint category (T200517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:00:57] T200517: Emit lint error or category when a page uses duplicate HTML IDs - https://phabricator.wikimedia.org/T200517 [21:01:00] Please test. Thanks. [21:01:03] ok! [21:01:54] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:02:00] Dreamy_Jazz: everything looks good [21:02:06] !log dreamyjazz@deploy1003 dreamyjazz, cscott: Continuing with sync [21:05:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10162717 (10phaultfinder) [21:06:45] Scap is being very slow [21:06:52] (03PS16) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [21:06:55] We are not yet deployed even to the canaries [21:09:27] Deployed to the canaries, will probably be 20 mins+ till it's complete [21:09:35] ok [21:09:57] Actually the regular deployment is going much faster [21:10:13] I think it's done in much larger batches [21:10:53] Maybe 5 mins :) [21:13:20] (03CR) 10CI reject: [V:04-1] sre.dns.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1074266 (https://phabricator.wikimedia.org/T375232) (owner: 10CDobbins) [21:15:19] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073871|Re-order arguments to DataAccess::addTrackingCategory]], [[gerrit:1074253|Add a "duplicate-ids" lint category (T200517)]] (duration: 29m 20s) [21:15:23] T200517: Emit lint error or category when a page uses duplicate HTML IDs - https://phabricator.wikimedia.org/T200517 [21:15:28] Done! [21:15:32] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:15:37] !log Evening UTC backport window done [21:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:20] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:17:18] Dreamy_Jazz: thanks so much! [21:17:24] Np [21:20:05] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:20:58] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [21:24:28] (03PS7) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:24:30] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:28:26] (03CR) 10Pppery: [C:03+1] "I suspect if I squinted at this list longer I could come up with some more ideas but this looks good enough to me now." [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:36:21] (03CR) 10Dzahn: "thanks a lot for the suggestions and reviews. Yea, at some point it's better to break it down into smaller patches, for sure." [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:39:43] (03CR) 10Dzahn: "so the diff between PS1 and latest PS of this should now be the "human filter" that subtracts the "bad domains" from the "ok/other mark mo" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:43:50] (03CR) 10Dzahn: [C:03+1] gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [21:52:22] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10162880 (10Dwisehaupt) [21:52:39] (03PS4) 10Dzahn: contint: remove jdk-11 packages [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) [21:53:20] (03CR) 10Dzahn: [C:03+1] "whenever Antoine agrees" [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [21:57:31] (03PS9) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) [21:58:52] (03PS10) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) [22:00:39] (03PS1) 10Dzahn: acme_chief: authorize new machine gerrit2003 to fetch gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/1074275 (https://phabricator.wikimedia.org/T372804) [22:01:14] (03CR) 10Pppery: [C:03+1] "Feel free to split these unrelated comments on existing ncredirs to a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [22:05:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:05:52] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:06:18] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:09:26] (03CR) 10Dzahn: "This also needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 but I fixed the other review comment." [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [22:11:11] (03CR) 10Dzahn: "let's wait until the week 09/30. no rush" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [22:14:20] (03CR) 10Dzahn: [C:03+2] acme_chief: authorize new machine gerrit2003 to fetch gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/1074275 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:21:46] (03PS8) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [22:21:53] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [22:22:26] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [23:03:20] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:03:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:52] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:08:41] (03PS1) 10Jdlrobson: Drop support for non-Codex message box styles in Vector 2022 and Vector [skins/Vector] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074282 (https://phabricator.wikimedia.org/T360668) [23:08:50] (03PS2) 10Jdlrobson: Drop support for non-Codex message box styles in Vector 2022 and Vector [skins/Vector] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074282 (https://phabricator.wikimedia.org/T360668) [23:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074287 [23:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074287 (owner: 10TrainBranchBot) [23:54:23] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:56:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10