[00:21:09] FIRING: LVSHighCPU: The host lvs7003:9100 has at least its CPU 19 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [00:26:09] RESOLVED: LVSHighCPU: The host lvs7003:9100 has at least its CPU 19 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [07:28:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11970653 (10ayounsi) `--move-vlan` is only made to migrate core DCs from legacy to new per rack vlans. Let me know if its worth spending... [08:20:24] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11970802 (10MatthewVernon) [08:50:07] 10netops, 06Infrastructure-Foundations, 06SRE: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux - https://phabricator.wikimedia.org/T423430#11971143 (10ayounsi) Once this is fixed we can remove `|ibgp` from the [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/1295805 | RejectingBGPPrefixes... [10:25:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11971500 (10cmooney) >>! In T427393#11970653, @ayounsi wrote: > `--move-vlan` is only made to migrate core DCs from legacy to new per rac... [11:04:21] hi, i'd like some help undoing low-traffic lvs for eventstreams-internal https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410 [11:38:52] ping Traffic [12:05:25] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11971733 (10ops-monitoring-bot) Draining ganeti2027.codfw.wmnet of running VMs [12:08:34] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11971736 (10ops-monitoring-bot) VM kubestagemaster2005.codfw.wmnet switching disk type to drbd [12:26:18] atsukoito: hello [12:26:26] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11971797 (10ops-monitoring-bot) Draining ganeti2027.codfw.wmnet of running VMs [12:26:37] fabfur: hi [12:27:04] i want to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295410, when is the best time to do it? [12:27:29] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11971798 (10ops-monitoring-bot) VM kubestagemaster2005.codfw.wmnet switching disk type to plain [12:29:14] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11971800 (10ops-monitoring-bot) Draining ganeti2027.codfw.wmnet of running VMs [12:34:28] fabfur: if i understood https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service correctly, I'll need to 1) restart pybal, and 2) do ipvsadm --delete-service on a few servers [12:35:53] can I proceed myself, or should I wait someone to be online here? [12:54:04] if you want to proceed I can help if needed [12:55:06] fabfur: thanks! I'm going to merge the diff now and proceed with running puppet on O:lvs::balancer [13:01:06] ok [13:02:30] puppetmerging (in fact, _joe_ is) [13:04:21] running puppet on O:lvs::balancer [13:06:21] <_joe_> atsukoito: btw I see the instructions there still say "wait some time" between pybal restarts; there is a better way to know when you can proceed. [13:06:48] <_joe_> on the server where you restarted pybal, use curl localhost:9090/metrics | grep pybal_bgp_session_established [13:06:48] _joe_: I'm all ears! [13:07:00] thanks [13:07:07] <_joe_> and check all the connections that were established before the restart are up again [13:07:35] <_joe_> there's also a cookbook but I'm not sure it still works correctly :/ [13:09:28] * atsukoito acked PyBal IPVS diff check [13:12:13] the cookbook does work but you have to manually ACK the Icinga check for it to finish. so in that respect, the manual approach is better. [13:12:30] and of course this is temporary till we roll out Liberica in the cores and life is easier again :) [13:12:33] ok, I'll proceed with restarting one-by-one [13:15:37] wait, the manual says to restart on backup server and mentions A:lvs-low-traffic-eqiad and then on primary and mentions A:lvs-low-traffic-eqiad again [13:15:49] should it be different servers? [13:16:40] I think it's intended the *primary* of A:lvs-low-traffic-eqiad and the *secondary* of the same subset [13:16:49] I mean backup and primary sorry [13:17:30] anyway for low-traffic in eqiad lvs1020 is the backup and lvs1019 is the primary [13:18:16] thanks! [13:23:18] fabfur, _joe_: restarted on lvs1020,lvs1019 [13:23:35] is wikipedia still up and running? [13:23:37] :) [13:23:38] <_joe_> 👍 [13:23:47] <_joe_> fabfur: for some value of up [13:24:11] no sticker for atsukoito then [13:25:04] stream.w.o is on low-traffic and still running [13:27:00] now the same on codfw? [13:27:45] lvs2014 first, then lvs2013? [13:28:54] or ipvsadm first? fabfur sukhe [13:29:36] atsukoito: doesn't really matter in that respect; the ipvsadm command is also same except the VIP and the port, so in theory you can do it at the very end [13:29:51] you can paste the command for review here or let us know if you prefer us to come up with it [13:33:05] `ipvsadm --delete-service --tcp-service 4992`, but I don't know how to find the ip, is it a primary server ip? [13:34:47] https://www.irccloud.com/pastebin/t3j9O7ez/ [13:34:51] is it this one? [13:35:05] atsukoito: yes, so that one for eqiad [13:35:08] and the other one for codfw [13:35:14] 10.2.1.35 [13:35:16] atsukoito: yes [13:35:59] fabfur, _joe_: restarted on lvs2014,lvs2013 [13:47:47] `atsuko@lvs1019:~$ ipvsadm --delete-service --tcp-service 10.2.2.35:4992` [13:48:32] sukhe@cumin1003:~$ dig -x 10.2.2.35 +short [13:48:32] eventstreams-internal.svc.eqiad.wmnet. [13:48:34] looks good [13:49:59] `atsuko@lvs2014:~$ ipvsadm --delete-service --tcp-service 10.2.1.35:4992`(checked `atsuko@lvs2014:~$ dig -x 10.2.1.35 +short`) [13:50:56] +1 [13:51:00] 09:50:12 <+icinga-wm> RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal [13:51:06] also a good sign [13:56:12] all the alerts are cleared [13:56:17] thanks atsukoito, and nice job! [13:57:42] when you have a minute, what should I do with https://gerrit.wikimedia.org/r/c/operations/dns/+/1267042 ? [14:03:17] XioNoX: reading and refreshing my memory [14:04:33] the probenet data and the email match in that respect, for at least YT right? [14:05:36] drmrs is still preferred for both, but right now the default is esams for AF [14:05:55] there is the open question of Reunion Island, [14:05:59] but we can take that later [14:07:16] Yeah, data for the Reunion island shows that eqsin is best, so I'd prefer to keep geodns there [14:08:19] for YT there is no real data, so until we have better we should follow what the people said in their email [14:08:44] for AF, yeah maybe we should switch it to drmrs, I don't think it makes sens to have esams there [14:09:28] XioNoX: there is some data for YT though https://gerrit.wikimedia.org/r/c/operations/dns/+/1267042/comments/2028dd85_aa962ee1 [14:09:35] that also points to drmrs, as short the sample size is [14:09:48] I +1ed the patch if it helps [14:10:20] sukhe: afaik it points to eqsin, no ? at least for mean/median [14:10:38] but yeah sample size of 7/9 I think it's like no data [14:11:40] YT per the comment points to drmrs for mean and median (which well is the same in this case) [14:11:48] |Africa |Mayotte|YT |drmrs|244.0 |244.0 |222.0 |266.0 |62.23 |3872.0 |2 [14:11:51] this one basically [14:12:36] but it's lower for eqsin, no? [14:12:50] 206.57 |161.0 [14:13:02] XioNoX: that's RE? [14:13:09] are we talking about RE or YT? [14:13:13] er, looking at the wrong comment, yeah [14:13:32] haha, there is just no data for eqsin :( [14:13:35] :) [14:13:42] and 2 sample for the others, we can't use that [14:13:47] so YT is fine I think since you have the email confirmatoin too [14:13:50] *confirmation [14:13:58] yeah, sounds good to me [14:14:35] +1ed, feel free to merge or ask me to [14:16:59] thanks sukhe, fabfur, and _joe_ for support during my first lvs operation [14:17:07] yw :) [14:20:04] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: bird bfd session with 172.20.1.1 down - Bad packet from 172.20.1.1 - unknown session id - https://phabricator.wikimedia.org/T427202#11972440 (10ayounsi) p:05Triage→03Low [14:20:10] <_joe_> atsukoito: congrats, and you didn't even win a t-shirt [14:20:17] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: Unrack old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427283#11972441 (10ayounsi) p:05Triage→03Low [14:20:33] <_joe_> atsukoito: https://commons.wikimedia.org/wiki/File:Framed_%22I_BROKE_WIKIPEDIA..._THEN_I_FIXED_IT!%22_T-shirt.jpg this one in case you're wondering [14:20:41] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11972454 (10ayounsi) p:05Triage→03Medium [14:21:54] sukhe: meetings then I'm off so I'll do it tomorrow [14:22:27] XioNoX: I will take care of it since it has been sitting for a while. [14:22:36] ok, thanks! [14:38:45] 10netops, 06SRE, 06Traffic-Icebox: experiment with reenabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288#11972619 (10LSobanski) Untagging IF. [14:51:02] Puppet is disabled on durum5003 since May 26 with a note about a reimage? that reimage doesn't seem to have happened, I would re-enable it in the mean time? [14:51:20] moritzm: ouch yeah, that's me [14:51:25] I couldn't get it to reimage [14:51:27] I should try again [14:51:50] fine either way, I can re-enable Puppet now or leave it as-is for a new attempt? [14:51:57] yeah fine to leave it, I will start a new reimage [14:52:01] (done) [14:52:11] ok! let me know if there are still issues, happy to have a look [14:52:16] yeah thanks, I might ask for help [15:14:15] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11972844 (10sbassett) [16:01:44] sukhe, moritzm the re-image issue is because of the routed ganeti, and the current intermediary state until we refresh the switches [16:02:24] XioNoX: yeah, I think that's it. [16:02:42] I have to head for lunch for now but the reimage failed, I can provide better logs and open a task shortly [16:02:50] in short need to disable the DHCP relay daemon on the core routers for the VM re-image to work. As otherwise the core routers see the dhcp requests from the ganeti nodes as rogue and block it [16:03:30] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973104 (10ops-monitoring-bot) Draining ganeti2045.codfw.wmnet of running VMs [16:05:38] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973112 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to drbd [16:15:24] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973137 (10MoritzMuehlenhoff) [16:51:17] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11973277 (10BCornwall) 05Open→03Resolved That's a fair point, and considering we're on nvme drives power loss is less of a concern as well since it's non-volatile.... [17:00:40] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973321 (10ops-monitoring-bot) VM dse-k8s-etcd2001.codfw.wmnet switching disk type to drbd [17:16:20] 06Traffic, 10MediaWiki-File-management: Wikimedia Commons: incorrect 429 responses for thumbnail errors - https://phabricator.wikimedia.org/T419663#11973365 (10neriah) [17:34:28] 06Traffic, 10MediaWiki-File-management: Wikimedia Commons: incorrect 429 responses for thumbnail errors - https://phabricator.wikimedia.org/T419663#11973487 (10neriah) 05Open→03Resolved p:05High→03Low [17:44:28] 06Traffic, 06SRE: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836 (10ssingh) 03NEW [17:44:30] 06Traffic, 06SRE: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#11973510 (10ssingh) p:05Triage→03Medium [17:59:30] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973526 (10ops-monitoring-bot) Draining ganeti2045.codfw.wmnet of running VMs [18:01:38] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973540 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to plain [18:03:05] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973544 (10ops-monitoring-bot) VM dse-k8s-etcd2001.codfw.wmnet switching disk type to plain [18:05:54] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11973560 (10ops-monitoring-bot) Draining ganeti2045.codfw.wmnet of running VMs [21:48:22] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11974306 (10sbassett) Revisiting this over the past month, it looks like we're receiving, on average, ~ 2500 report-on... [23:45:33] Is there a utility somewhere that I can use to test an IP against the /etc/haproxy/ipblocks.d/all.map list the way that `map_ip()` does?