[00:08:11] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10123906 (10Papaul) [00:33:52] * brett poofs [01:12:00] FIRING: [2x] PurgedHighEventLag: High event process lag with purged on cp2038:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:24:48] 06Traffic, 06SRE, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10124079 (10Vgutierrez) this has been triggered again in cp2038 and cp2041: ` vgutierrez@cumin1002:~$ sudo -i cumin 'cp[2038,2041].codfw.wmnet' 'journalctl -u purged.service --sin... [04:27:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp2038:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:32:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [04:32:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp2038:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:47:00] RESOLVED: PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:47:30] FIRING: PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:47:30] RESOLVED: PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:52:30] FIRING: PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [04:57:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:02:00] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:03:30] FIRING: PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:08:30] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:32:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp2041:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:57:32] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124177 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [06:57:48] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [06:57:50] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [06:57:57] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124180 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [07:01:03] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [08:01:30] 10Acme-chief, 06Traffic, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06SRE: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124321 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002... [08:24:41] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [08:32:22] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes202... [08:33:16] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes203... [08:37:01] 10Acme-chief, 06Traffic, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06SRE: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124378 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002... [08:42:33] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124386 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [08:42:34] 10Acme-chief, 06Traffic, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06SRE: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124387 (10MoritzMuehlenhoff) [08:43:33] 10Acme-chief, 06Traffic, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06SRE: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124388 (10MoritzMuehlenhoff) 05Open→03Resolved All done! [08:43:39] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [08:48:34] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124417 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [08:48:37] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [09:46:12] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [09:57:54] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124728 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [10:41:06] 06Traffic: Elevated 503 backend fetch failed reported by users - https://phabricator.wikimedia.org/T364691#10124931 (10Ladsgroup) It is still happening, a user reported that for two days, he is constantly getting 503, Here is an example I have: Backened fetch failed. Timestamp: 2024-09-06-06:01:47 UTC. via cp30... [10:59:42] 06Traffic, 06SRE, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10125005 (10Vgutierrez) 05Stalled→03In progress a:03Vgutierrez [11:10:02] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [11:15:13] 06Traffic: Elevated 503 backend fetch failed reported by users - https://phabricator.wikimedia.org/T364691#10125053 (10Vgutierrez) esams seems to be as healthy as usual per https://grafana.wikimedia.org/goto/ix3gNVeSR?orgId=1: {F57468067} ATS metrics match HAProxy data https://grafana.wikimedia.org/goto/8WWHHV6... [11:26:58] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125115 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [11:28:42] 06Traffic: Elevated 503 backend fetch failed reported by users - https://phabricator.wikimedia.org/T364691#10125128 (10Vgutierrez) that specific request triggered a timeout while trying to read the POST request body: ` Sep 6 06:01:47 cp3069 varnish-frontend-fetcherr[1010101]: @cee: {"time": "2024-09-06T06:01:47... [11:34:32] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2320 to w... [11:35:46] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2... [11:41:01] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2321 to w... [11:47:26] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2322 to w... [11:50:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125289 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:50:27] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:50:30] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125291 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:50:45] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:50:54] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2332 to... [11:51:04] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125309 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:51:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125310 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:51:24] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125311 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:51:38] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [12:03:24] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2333 to... [12:13:29] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2334 to... [12:15:43] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125390 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:15:44] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125391 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:16:42] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125393 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:16:51] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125394 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:17:20] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125395 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:17:59] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125396 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:17:59] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#10125397 (10akosiaris) >>! In T364400#10123233, @BPirkle wrote: >> I was wondering if we have enough consensus to proceed in the path of having /api/ be rewritten in MediaWiki to /w/... [12:18:16] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125398 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:18:30] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [12:24:08] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125427 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:24:30] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [12:29:26] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125439 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:29:42] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [12:33:14] 06Traffic: Provide a golang-github-confluentinc-confluent-kafka-go-dev version that matches librdkafka capabilities for bullseye - https://phabricator.wikimedia.org/T374232 (10Vgutierrez) 03NEW [12:43:15] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125567 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2097.codfw.wm... [12:47:49] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125594 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumbering for host kubestage2... [12:48:22] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125596 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [12:49:38] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet... [12:54:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [12:57:05] <_joe_> hi, enterprise is asking if they can use Range headers in downloading images from commons for larger files [12:58:09] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125623 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering... [13:01:35] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes203... [13:02:46] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125629 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [13:02:50] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [13:06:14] _joe_: short answer, no [13:06:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [13:07:35] <_joe_> vgutierrez: I'd use a longer one too :) [13:07:46] _joe_: long answer, we don't support Range requests on the text cluster, but they can use it to fetch the actual blob from upload.wikimedia.org [13:07:56] <_joe_> yes that's what they're doing [13:08:07] <_joe_> they want to do that to control bw usage I guess [13:08:16] that's fine AFAIK [13:08:31] <_joe_> I was wondering if it wasn't more stressful for varnish/ats [13:09:57] varnish gets bypassed for any request where using Range would make sense in terms of size [13:10:31] <_joe_> right [13:10:35] and AFAIK we are not aware of any issues with range requests and ATS [13:10:45] happy to help if they discover something :) [13:10:52] <_joe_> ack [13:11:23] well, if it's cached in varnish, Range will actually work and serve a chunk from cache, I think. But not if it's not. IIRC [13:11:35] but we also only store a limited window of sizes in frontend varnish, too [13:11:51] yeah, that's why I said "varnish gets bypassed for any request where using Range would make sense in terms of size" [13:12:01] yeah [13:12:23] in any case.. well-behaved UAs that try to save resources are welcome :) [13:12:25] 8MB limit? [13:13:02] yes [13:13:27] we could maybe consider tweaking those limits. needs digging to find good values though, but I know those values are "old" [13:13:43] (older access patterns, smaller memory on the cache hosts, etc) [13:25:31] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [13:26:25] FIRING: SystemdUnitFailed: user@0.service on cp4050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:18] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [13:30:58] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125698 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:31:08] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2095.codfw.wm... [13:31:18] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125700 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [13:31:34] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125715 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:31:44] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125716 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:32:54] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125733 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:32:57] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125734 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:36:15] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with... [13:36:25] RESOLVED: SystemdUnitFailed: user@0.service on cp4050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:10] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125845 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host kubestage2001.... [14:12:29] 06Traffic: Elevated 503 backend fetch failed reported by users - https://phabricator.wikimedia.org/T364691#10125924 (10Ladsgroup) 05Open→03Declined It's client side :( [14:13:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2102.codfw.wmne... [14:17:31] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125940 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host wikikube-worke... [14:35:48] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247 (10cmooney) 03NEW p:05Triage→03Medium [14:37:29] ^^^ folks just want to make you aware of this one [14:37:54] almost an exact repeat of T374155 which myself and John tackled last night [14:37:54] T374155: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155 [14:38:39] I made the call myself yesterday to disable pybal on lvs1019 and we fixed the problem (bad optic) as nobody else was online [14:39:09] Can I proceed and do the same for this similar issue with lvs1017 if I can get someone from dc-ops to look? [14:39:32] FYI it looks less urgent in this case as there appears to be almost no use on this link (from lvs1017 to servers in rows E-F) [14:41:00] aka stop pybal && disable puppet? [14:41:09] yeah [14:41:20] make sure things fall-over to lvs1020, then take a look a the link [14:41:39] I expect like last night its an optic gone bad on the host side - maybe a bad batch or they've hit their shelf life [14:41:54] so.. I don't know [14:41:59] lvs1020 config seems to be outdated [14:42:14] let me see what's going on [14:42:25] ok thanks [14:42:53] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126049 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [14:43:02] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [14:43:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [14:43:24] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126053 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering... [14:44:02] inflatador, akosiaris are you around? [14:44:28] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [14:44:29] actually.. just akosiaris :) [14:45:09] akosiaris: I see you restarted pybal on lvs1019, but not lvs1020 and we got an outstanding alert for lvs1020 regarding an outdated config [14:45:17] vgutierrez yes [14:45:32] ah, I guess I'm not needed then ;) [14:47:14] vgutierrez: I stopped and started pybal on lvs1019 yesterday too if that's relevant [14:47:27] standing alert is 4h old [14:47:43] ok yeah it was ~16h ago I did it [14:49:27] vgutierrez: wait, double checking because I am sure I did [14:49:46] in fact, I 've even re-scheduled icinga checks [14:50:03] so puppet restored a config file [14:50:04] https://puppetboard.wikimedia.org/report/lvs1020.eqiad.wmnet/c79467ff261ef2eeb6f0e760dbaa39c09399a1b7 [14:50:14] you edited pybal config file manually? :) [14:50:38] no, that's expected, gimme a sec [14:51:02] https://gerrit.wikimedia.org/r/q/topic:%22php72_removal%22 [14:51:08] I 've removed today php72 [14:51:13] the check that is [14:51:21] oh ok.. the commit date is old but got merged today [14:51:22] gotcha [14:51:25] that's why pybal needed a restart [14:51:55] Active: active (running) since Fri 2024-09-06 14:37:45 UTC; 13min ago [14:52:13] yeah.. alerts.wm.o is kinda slow [14:52:15] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [14:52:19] it is [14:52:45] I have a systemctl restart pybal line in by history on lvs1020 fwiw [14:52:52] yeah.. it got restarted 13 minutes ago [14:52:54] pretty sure I did restart it ;-) [14:53:14] icinga is complaining as well [14:53:17] sigh [14:53:29] I did re-submit checks for those [14:54:42] ok, that explains everything, thx akosiaris and sorry for the unnecessary ping [14:54:48] topranks: you can go ahead and depool lvs1019 [14:54:53] no worries, makes sense [14:55:06] [why are we doing this on a Friday EU evening BTW?] [14:55:51] vgutierrez: ok thanks [14:56:18] we can wait if you would prefer however [14:56:31] packet loss / error rate on that link is about 50%, so it's basically unusable [14:56:42] however doesn't seem to be much traffic trying to use it so maybe it doesn't matter [14:56:55] up to you, John is on site if we want to take a look now [15:03:09] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126147 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering... [15:08:27] vgutierrez: so.... what's your thoughts should I proceed here? [15:08:42] sure, go ahead [15:08:56] I wasn't implying that we should stop now with John on site [15:11:59] no worries - just making sure, like I say the low usage I think makes this - which would be an emergency otherwise - less urgent [15:12:13] still though it's fairly straightforward we'll go ahead [15:15:12] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247#10126210 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c63ff66a-28d3-4567-b7cc-a03c0da01345) set by cmooney@cumin1002 for 2:... [15:16:48] 06Traffic, 06MW-Interfaces-Team, 06serviceops, 13Patch-For-Review: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#10126207 (10akosiaris) 05Open→03Stalled Change reverted, I 'll switch this to `stalled` until we have enough consensus. Let us know of any updates or if you... [15:19:09] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10126233 (10Eevans) >>! In T373097#10121448, @MatthewVernon wrote: > There are 4 swift servers in `C4` - ms-be2058 ms-... [15:24:36] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2430 to wi... [15:28:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10126265 (10Eevans) >>! In T373101#10121463, @MatthewVernon wrote: > There are some impact Swift servers: > - ms-be2054 and ms-be2... [15:29:47] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126281 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by kamila@cumin1002 Renumber... [15:31:39] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10126290 (10Eevans) >>! In T373102#10121495, @MatthewVernon wrote: > These racks have the following Swift/Ceph nodes:... [15:32:53] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [15:42:43] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247#10126347 (10cmooney) 05Open→03Resolved Ok we have replaced the optic in lvs1017 (same model as the one taken from lvs1019 for the record),... [15:49:30] FIRING: [7x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [15:50:36] ^ me using the rolling reboot cookbook on durum* [15:51:08] yes, I used grace-sleep and only 2 at a time .. [15:54:30] RESOLVED: [6x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [15:54:30] RESOLVED: [6x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [16:04:33] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [16:31:15] 06Traffic, 06Content-Transform-Team-WIP, 06Data-Persistence, 10iOS-app-feature-Performance, and 7 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365#10126553 (10Seddon) [18:39:11] 10netops, 06Infrastructure-Foundations: asw2-d-eqiad vcp links flapping - https://phabricator.wikimedia.org/T374272 (10CDanis) 03NEW [18:39:20] 10netops, 06Infrastructure-Foundations: asw2-d-eqiad vcp links flapping - https://phabricator.wikimedia.org/T374272#10126939 (10CDanis) p:05Triage→03High