[00:04:25] RESOLVED: SystemdUnitFailed: haproxy_stek_job.service on cp2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:19] 06Traffic, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Better Benthos performances - https://phabricator.wikimedia.org/T360454#9887794 (10Fabfur) Update on Benthos performances. To be able to compare Benthos (now RedPanda) to some tools we already use, I've collected some data from cp... [10:43:27] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9888132 (10gmodena) The haproxy / benthos feed is now available in raw form under `wmf_staging.webrequest_frontend_rc0` and p... [12:28:13] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9888375 (10MatthewVernon) Just to note that per [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&va... [12:50:22] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408 (10cmooney) 03NEW p:05Triage→03Low [13:01:39] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9888532 (10cmooney) 05Open→03Resolved [13:42:09] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9888712 (10cmooney) We could use these cables but the host side but we might not have enough slack to connect to servers at dif... [13:47:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9888730 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94b81d4d-316b-4c68-b4a9-a2d07057d180) se... [14:21:29] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9888856 (10Clement_Goubert) [14:45:00] 06Traffic, 06Content-Transform-Team, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 13Patch-For-Review: Remove long term caching and active purging for Parsoid endpoints in RESTBase - https://phabricator.wikimedia.org/T365630#9889005 (10daniel) [14:49:13] 06Traffic, 06Content-Transform-Team, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 13Patch-For-Review: Remove long term caching and active purging for Parsoid endpoints in RESTBase - https://phabricator.wikimedia.org/T365630#9889044 (10daniel) [15:03:36] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889146 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=891c00a3-b649-4659-b39f-5ad6b01367a9) se... [15:04:47] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889149 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5a6a58c5-4681-4aea-8e80-e8ba2c613022) se... [15:22:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889279 (10cmooney) Switch has reloaded on the new version, all looks good at first glance. ` cmooney@lsw1-f6-eqiad... [15:36:16] 06Traffic, 06Content-Transform-Team, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 13Patch-For-Review: Remove long term caching and active purging for Parsoid endpoints in RESTBase - https://phabricator.wikimedia.org/T365630#9889390 (10FJoseph-WMF) 05Open→03In progress [15:37:23] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889405 (10MatthewVernon) Swift looks good, thanks. [15:40:40] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889411 (10Jdforrester-WMF) Looks like this is now done except for "some straggling traffic" for the api-gateway? {F55289507} [15:54:26] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889560 (10Clement_Goubert) Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters. [15:57:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889580 (10cmooney) 05Open→03Resolved Thanks for checking things, all stable on our side I will close the ta... [15:57:56] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889584 (10hnowlan) I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway refers to traffic to the mediawiki API as "mwapi_cluster"... [16:05:34] 10netops, 06Infrastructure-Foundations, 06SRE: No IPv6 ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 (10cmooney) 03NEW p:05Triage→03High [16:30:13] 10netops, 06Infrastructure-Foundations, 06SRE: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9889810 (10cmooney) [17:39:21] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [17:57:34] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4... [17:58:03] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [18:17:46] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890275 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4... [18:36:35] 06Traffic, 06DC-Ops, 10ops-ulsfo, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890351 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [19:51:37] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: -... [19:53:21] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [20:16:35] brett: a part of this change you made on teh Reimage page on wikitech is Dell-specific and wouldn't work on other platform, please link the Platform Specific documentation instead: https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle%2FReimage&diff=2192604&oldid=2120335 [20:24:43] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:25:23] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:32:58] 10netops, 06Infrastructure-Foundations, 06SRE: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890729 (10cmooney) It seems this was an inadvertent result of the upgrade to the codfw row A/B switches, and the move there from a purely L2 switching layer to a rout... [20:33:07] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.96:443 @ cp4038 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:33:13] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:34:43] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:35:23] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:38:07] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.96:443 @ cp4038 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:41:07] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.96:443 @ cp4038 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:44:02] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:44:07] 06Traffic, 06DC-Ops, 10ops-ulsfo: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**P... [20:46:07] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.96:443 @ cp4038 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:58:13] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:02:29] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:04:02] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:07:21] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:12:21] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:12:29] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:14:38] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:18:03] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890880 (10cmooney) I've pushed this change to cr2-eqdfw and it seems to be doing what we need there: Codfw /48 is announced to Facebook: ` cmoo... [21:18:13] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:23:51] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:28:08] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890909 (10cmooney) I'm monitoring the change in traffic levels. Right now it seems negligible, however that is not much surprise, prior to the... [21:28:51] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:31:38] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:44:19] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890956 (10cmooney) Just to note that for the same time period (since March 5th) we've not been announcing the codfw aggregates from eqord: ` cmo... [21:48:13] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:53:13] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:01:38] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:03:21] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:13:13] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:13:21] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:18:13] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:19:44] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:24:38] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:29:44] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:31:13] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:38:13] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:46:41] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [22:48:13] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:51:41] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [22:53:13] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:54:03] 06Traffic, 06Data-Platform-SRE: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) - https://phabricator.wikimedia.org/T365616#9891164 (10bking) @Vgutierrez Awesome, thank you for the comprehensive plan of action. I'll get to work on the puppet patches starting tomorrow. Once the pat... [22:54:47] 06Traffic, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) - https://phabricator.wikimedia.org/T365616#9891166 (10bking) 05Stalled→03In progress a:03bking [22:58:13] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:58:21] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:03:09] FIRING: [8x] LVSHighCPU: The host lvs3008:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [23:08:09] RESOLVED: [8x] LVSHighCPU: The host lvs3008:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [23:23:13] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:31:41] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:38:13] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:43:13] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:43:21] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:51:41] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:53:13] FIRING: [3x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent