[09:24:48] 10netops, 06Infrastructure-Foundations, 10ops-magru, 06SRE: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10578164 (10ayounsi) 05Open→03Resolved a:03ayounsi No more errors. [09:50:09] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10578273 (10JMeybohm) >>! In T384731#10566953, @fgiunchedi wrote: >>>! In T384731#10563685, @ayounsi wr... [10:32:10] I'd like to merge the previously-approved PCS/hewiki change now-ish if that suits https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117508 [10:54:05] ack [11:02:37] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10578486 (10cmooney) 05Open→03Resolved Gonna close this one at this point. All has been ok in eqiad and codfw since the increase in thread count last week - gaps are no... [11:34:06] 06Traffic: Remove katran blockers for low-traffic non-k8s based services - https://phabricator.wikimedia.org/T373020#10578589 (10Vgutierrez) [12:20:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10578641 (10cmooney) >>! In T385217#10572967, @cmooney wrote: > DC-Ops folks Nokia reccomend trying to interrupt the grub bootlo... [13:01:48] o/ [13:01:58] started as WIP patch the geo-maps eqiad depool: https://gerrit.wikimedia.org/r/c/operations/dns/+/1122545 [13:02:18] I still have some doubts/question marks, if anybody has time to review/check lemme know [13:06:59] 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10578780 (10Gnoeee) >>! In T383210#10575269, @ssingh wrote: > @Gnoeee: This has been rolled out and should now be live. Please feel free to re-open this task if there are any issues. Thank yo... [13:09:14] * vgutierrez looking [13:11:33] please don't -2 at the first line that you read, it is still wip :D [13:15:45] I don't -2 you ;P [13:16:11] elukey: regarding your magru question.. I don't see any place where you're moving traffic from eqiad to magru as the side effect of your change [13:17:13] it looks like if we need to depool magru some networks normally would hit eqiad would need to reach codfw instead [13:20:16] vgutierrez: might you be available to join a call with bblack and the Experiment Platform crew at 1800 your time? i think normally that doesn't work, but figured i'd ask. kwakuofori asked if it would be possible to get you on invitations related to Experimentation Lab. as fate would have it the meetings where we've asked for bblack's attendance are completed, _but_ Sam S and I have had recurring meetings with bblack... [13:20:52] dr0ptp4kt: sure thing, send an invitation please [13:21:01] is that today? [13:21:03] 06Traffic, 06MediaWiki-Engineering, 06serviceops, 07Upstream, 07Wikimedia-production-error: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395#10578816 (10jijiki) [13:21:31] so, i was thinking that probably simplest would be to figure out how we might loop you in. i realize today probably isn't good. for future cases, we do have 1700 meetings on most tuesdays and thursdays - and really to the extent we could have you or Brandon or both at those it might be helpful. but making our calendars align is a bit tricky as we all know. thanks for confirming! i'll add you in there. [13:21:59] thanks, dr0ptp4kt. perhaps, you can share the recordings and notes with vgutierrez so he can catch up with you guys [13:23:56] kwakuofori: yes, good idea! vgutierrez you happen to check Slack on occasion? if so, i'll add you to the org-facing `#experiment-platform-team` so you can see updates. but, additionally, for the stuff so far what you'd want to do is clickt hrough on the links from the weekly asana update. i'll email forward that thing to you to spare you having to hunt in the asana ui [13:24:44] dr0ptp4kt: sounds good, yeah.. I'm not in there (Slack) all the time but I try to look at it from time to time [13:38:31] vgutierrez: right right ok, now I have a doubt though - when depooling eqiad, we want eqiad traffic to stay within the DC, but we also want to avoid cache pops to connect to eqiad as well (like Magru's ATS reaching codfw istead). [13:38:54] and we won't depool eqiad from discovery records [13:39:10] yeah, that's right [13:39:16] that will impact magru experiment [13:40:14] the bit that I am missing is if this will happen with my patch or not [13:40:56] hmmm that depends on how the DNS discovery records are configured [13:41:38] I might need a third coffee [13:41:56] scratch my last sentence and let me think [13:42:07] I may need one too, I have the same doubt :D [13:42:25] because the idea is to avoid any discovery change, so within eqiad we'll be able to hit eqiad services etc.. [13:44:29] elukey: ok.. so active/passive discovery DNS records should still get traffic in eqiad AFAIK [13:44:44] active/active should see all the traffic diverted to codfw [13:45:27] (assuming that eqiad is pooled and codfw is depooled for active/passive ones) [13:45:58] 10netops, 06Infrastructure-Foundations, 06SRE: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#10578920 (10cmooney) >>! In T387018#10574426, @ayounsi wrote: > Enabling traceoptions shows a `no shared cipher` error on the switch : > ` > Feb 24 09:33:58 ssl_transp... [13:46:33] elukey: that makes sense? :) [13:52:22] vgutierrez: I am missing something about geo-maps works, or better I may have known in the past but I clearly forgot: say that we are on a magru cp node, and we resolve a discovery record - without discovery changes, that will return (probably) and eqiad LVS IP. From here onward my brain gets confused :D [13:53:15] elukey: active/active or active/passive one? [13:53:36] say active/active [13:53:45] if it's active/active the geo-map entry for the magru DC network should pick the first one [13:54:18] if active/passive and codfw is the passive one eqiad would still get the traffic [13:56:46] elukey: so you need to include in your CR the private/mgmt ranges of every DC [13:57:05] for magru that's 10.192.0.0/12 [13:57:33] sorry, 10.140.0.0/16 [13:57:36] wrong copy&pasta [13:58:23] vgutierrez: yep I was about to ask that [13:59:06] so I should read those line configs as - if I am a host in this subnet (like in magru), my preference for dns discovery etc.. resolution is X Y Z [14:02:14] yes [14:02:26] same as the others.. but based on source IP rather than geoIP data [14:02:33] okok now it makes more sense thanks :) [14:02:36] s/IP/network/ [14:03:08] going to add the changes that we discussed + the one that you added in the review [14:07:20] elukey: https://gerrit.wikimedia.org/r/c/operations/dns/+/1113205 [14:07:59] I thought this was sufficient for what we were trying to do but reading the above backlog [14:10:21] sukhe: but we also need to steer traffic from the eqiad PoP as well [14:12:00] yeah ok then, I will abandon this one and we can focus on the elukey one [14:18:11] sukhe: ah snap sorry I didn't see your patch :( [14:19:17] elukey: no worries! it was incomplete anyway. but yeah, make sure to update private addresses to put eqiad at the end in your patch too, like here https://gerrit.wikimedia.org/r/c/operations/dns/+/1122545/3/geo-maps#316 [14:23:15] yep yep! [14:23:27] currently waiting for tox to validate locally [14:24:25] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10579051 (10fgiunchedi) >>! In T384731#10578273, @JMeybohm wrote: >>>! In T384731#10566953, @fgiunchedi... [14:44:40] 10netops, 06Infrastructure-Foundations: BGP peers with missing descriptions - https://phabricator.wikimedia.org/T387220 (10ayounsi) 03NEW p:05Triage→03Low [14:48:43] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10579181 (10ayounsi) >> And what happens if peer_descr is missing or empty ? > good question, in that c... [14:50:48] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10579186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs7003.magru.wmnet with OS bookworm [14:57:55] o/ me again - I'd like to move ahead with the citoid migration to something that looks vaguely like group0. We're reasonably confident in citoid's behaviour and it's fairly low-risk even with that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122542 [14:58:57] hnowlan: should be OK but note that we have our weekly meeting coming up in a minute [14:59:18] so if you can do it 16:00 UTC, that will be great. if not, now is also fine and we can take a look in between [15:00:57] sukhe: absolutely zero rush, I can wait until then. thanks! [15:13:24] * dr0ptp4kt vgutierrez: i realize i didn't respond directly: but yes, the 1800 meeting i was talking about was today. understood if it doesn't work today. [15:13:40] dr0ptp4kt: already accepted the convo :) see you later [15:13:45] thanks! [15:55:44] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10579603 (10Vgutierrez) [15:57:04] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10579608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs7003.magru.wmnet with OS bookworm completed: - lvs7003 (**PASS**) - Downtimed on... [16:09:39] sukhe: okay to go ahead nowish? [16:09:45] hnowlan: yes please [16:10:27] thanks! [16:12:03] vgutierrez, sukhe - I had a chat with Joe about the patch and something came up, namely the discovery-map file that we deploy via puppet.. from the gdns' config it seems that geo-maps is related to public services, meanwhile for discovery we need to use something different (discovery-map) [16:12:08] does it make sense? [16:12:27] because if so, the proposal would be simpler: depool eqiad via cookbook, update discovery-map via puppet and reload gdnsd [16:12:52] for example, changing [16:12:54] 10.2.7.0/24 => [eqiad, codfw], # magru LVS [16:13:13] to prefer codfw, or maybe only codfw? [16:13:22] elukey: I tried to clarify that bit in the task https://phabricator.wikimedia.org/T380858#10427637 [16:13:32] but anyway, yes, that works and is definitely much preferred [16:13:44] if joe is OK with it, it's the easiest path forward [16:14:18] and yesh, depool eqiad for geodns and then update modules/profile/files/dns/auth/discovery-map [16:15:46] elukey: that's not the network for magru DC BTW [16:15:52] elukey: that's the LVS range [16:16:12] yes sorry wrong example [16:16:18] I think he was just giving an example but the file is right [16:17:26] elukey: no reload required as well, puppet will automatically do that for you fwiw [16:17:53] gdnsd watches admin_state but puppet takes care of reloading gdnsd on discovery-map [16:17:58] so yeah, very little work in that sense [16:19:21] sukhe: so overall the change would be to preserve what is set for the eqiad IPs in discovery-map, leaving codfw for the rest. So anything generated within eqiad would hit eqiad svcs (thanks to discovery), but the rest of the DCs would prefer codfw. [16:19:56] depool eqiad via admin state first, then rollout the puppet change to gdnsd [16:20:12] If this is workable then I prefer it way more [16:25:06] elukey: OK by us and yes, the above seems correct. also we can verify everything is working as intended once you do the above two steps [16:26:26] thanks a lot for the patience, and sorry for the back and forth [16:26:37] mostly due to my lack of knowledge about these configs [16:27:01] elukey: you are not alone <3 [16:34:23] filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122627/1/modules/profile/files/dns/auth/discovery-map [16:35:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10579756 (10cmooney) Myself and Jenn went on a call with Brooke, Saju and some of the other Nokia technical folks. They couldn'... [16:38:41] 10netops, 06Traffic, 06Infrastructure-Foundations: eqiad/esams/drmrs LVS: use Netbox BGP flag - https://phabricator.wikimedia.org/T380469#10579766 (10ayounsi) 05Open→03Resolved All done ! There was no diff, as expected in the best case scenario. [18:19:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10580081 (10cmooney) Ok all devices are back online and reachable via SSH, all running SR Linux v24.7.2. Tomorrow I'll try to f... [18:25:53] 06Traffic, 06MediaWiki-Engineering, 06serviceops, 13Patch-For-Review, and 2 others: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395#10580120 (10Scott_French) Thanks again, all. T386006 should now be substantially resolved from the standpoint of upgrading to a PC... [23:19:20] hey traffic folks I notice the units for the graphs on this page are wrong: [23:19:22] https://grafana-rw.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1 [23:19:45] The measure is "node_network_receive_bytes_total" but the unit is set to bits/sec [23:20:07] We should either change the units or (better) add * 8 to the queries to get the bps