[03:48:46] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10252637 (10Papaul) 05Open→03Resolved This is complete
[03:48:53] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10252634 (10Papaul) 05Open→03Resolved This is complete.
[06:59:36] <wikibugs>	 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10252733 (10Ckepper) Awesome, thank you - works like a charm 👍
[09:13:21] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10253170 (10cmooney) >>! In T377381#10252289, @Jclark-ctr wrote: > @cmooney  Step 1: Firewall Installation...
[12:34:29] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10253766 (10aborrero) 05In progress→03Stalled Turns out, to enable PTR creation support, per {T377740} we would need to eit...
[12:50:46] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10253808 (10ayounsi) a:05ayounsi→03Papaul
[12:50:47] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028#10253811 (10ayounsi) 05Open→03Stalled
[12:51:34] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638#10253832 (10ayounsi) 05Open→03Stalled a:05ayounsi→03None
[12:52:04] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10253836 (10ayounsi) a:03Papaul
[13:28:36] <hnowlan>	 o/ hello! I've been investigating (https://phabricator.wikimedia.org/T373517) some fairly tricky issues with timeouts of long-running connections to the shellbox-video service and we've ruled out most of the possible parts of the stack the request passes through 
[13:28:49] <hnowlan>	 er sorry, not timeouts, terminations of connections 
[13:29:22] <hnowlan>	 it seems that when we use the discovery record for the service we see the failure pattern, but if we directly connect to a k8s pod or worker IP address the request works 
[13:29:54] <hnowlan>	 The pattern we see is that the client connects to the service, makes the request (which can be a long wait), but once the service starts returning data to the client the connection is reset 
[13:30:43] <hnowlan>	 We currently make very similar requests against the videoscaler discovery service and have no issues, and the service.yaml configs for both are very similar 
[13:31:11] <hnowlan>	 I don't suppose there's anything historical or custom we do for videoscalers in lvs/ipvsadm that I might be missing? 
[13:33:57] <hnowlan>	 I'm aware this is quite vague and grasping at straws :) 
[13:42:23] <topranks>	 hnowlan: "once the service starts returning data to the client the connection is reset" 
[13:42:44] <topranks>	 does one or other side send a TCP RST packet?  or does this happen at a higher layer/
[13:43:38] <XioNoX>	 a packet capture (ideally on both sides) would be useful here
[13:44:06] <hnowlan>	 I'll get a capture of one side, getting the k8s side has proven tricky 
[13:47:17] <bblack>	 might be interesting to see if enable tcp keepalives on the client side has any effect?
[13:47:44] <bblack>	 if it's through lvs, I could see lvs losing the mapping after a lot of time passes with no packets in either direction.
[13:48:07] <bblack>	 the response packets go direct to the client host, but when it tried to ack them it would fall over because lvs forgot the connection for that direction to work.
[13:49:50] <bblack>	 (in general though, having http server endpoints that are known to pause for-ev-er between req and resp is a bad pattern to avoid.  better to have some async way of handling things (respond immediately with some reqid# or whatever, have client poll for completion of the req every once in a while, or something similar)
[13:50:29] <bblack>	 )
[13:51:01] <hnowlan>	 this is absolutely a bad pattern, but unfortunately we're doing an in-place substitution of the current videoscaler process with shellbox to help with migraiton to k8s
[13:51:14] <hnowlan>	 getting the capture now, the process takes ~40m 
[13:55:54] <bblack>	 TCP KA might save you though, if it's possible to turn that on in the client.  It will at least keep some middle-boxes aware of the connection and not forgetting it during the long idle.
[14:16:48] <topranks>	 bblack: excuse my ignorance here, maglev is stateless right?  it doesn't need to keep track of what connections are present?  But IPVS itself does enforce some kind of TCP state-machine sanity check?  i.e. it needs to see a SYN first, and ACK arriving out of the blue (for whatever reason, here because a previous connection has been forgotten due to inactivity) gets dropped?
[14:24:59] <hnowlan>	 topranks, XioNoX: pcap is at mwmaint1002.eqiad.wmnet:/home/hnowlan/shellbox.pcap (it's quite big as most of the flow is uploading a 200MB video file) 
[14:32:51] <topranks>	 hnowlan: that looks very like the LVS is resetting the connection 
[14:33:02] <topranks>	 There is loads of traffic for first 2 seconds 
[14:33:23] <topranks>	 then nothing for the next 2050 seconds (so like 30 mins or something?)
[14:33:57] <topranks>	 at that point the shellbox server starts sending traffic back to the client (packet 34538)
[14:34:34] <topranks>	 mwmaint1002 ACKs those new packets in 34540 
[14:35:11] <topranks>	 and immediately a RST is received from the shellbox-video IP (but I suspect it's the LVS sending) 
[14:35:18] <hnowlan>	 yeah, seeing some traffic returned in the first place is a surprise. in debug mode you can see envoy hitting a buffer limit when sending traffic but it never emtpies out when going via lvs 
[14:35:28] <hnowlan>	 so I was under the impression absolutely nothing went out
[14:35:54] <topranks>	 we could probably 100% confirm that by taking a pcap on the shellbox side as well (I suspect you won't see 34540 ACK arrive to it, and we can confirm it's not sending the RST in 34541)
[14:36:23] <topranks>	 TCP layer may not get enough data to send up the stack so yeah, app might not receive anything 
[14:37:05] <topranks>	 I think my question to Brandon is relevant here... and his suggestion of TCP keepalives makes sense (though I don't fully understand the LVS behaviour)
[14:37:15] <topranks>	 I see these configured, but need to understand them better I think
[14:37:22] <topranks>	 https://www.irccloud.com/pastebin/GfVjcbnn/
[14:37:23] <cdanis>	 if it is easy to enable tcp keepalives on the client side, i would certainly try it
[14:37:31] <cdanis>	 or i guess on either side
[14:37:33] <hnowlan>	 Unfortunately it doesn't seem like it'd be easy to get tcp keepalives added to the client - not out of the question, but not quickly 
[14:38:05] <topranks>	 Linux itself can do them I'm not sure we'd need anything else 
[14:38:43] <topranks>	 the problem is according to the current setting (and this ancient doc: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html), we should be sending keepalives after 5 mins of inactivity 
[14:38:47] <topranks>	 pcap does not show that happening 
[14:38:52] <cdanis>	 hnowlan: libkeepalive0 ?
[14:39:48] <hnowlan>	 given that the client is just a bunch of window dressing around curl we *should* be sending keepalives by default :/ 
[14:41:10] <cdanis>	 hnowlan: I thought curl's default was 0
[14:41:22] <cdanis>	 excuse me
[14:41:30] <cdanis>	 I thought curl(1)'s default was true, but, libcurl's default was 0
[14:41:47] <topranks>	 the socket options the sysctl's I pasted are configurable per-socket by an application, so perhaps the client is over-riding the system defaults 
[14:42:16] <topranks>	 cmooney@mwmaint1002:~$ curl --help | grep keepalive
[14:42:16] <topranks>	      --keepalive-time <seconds> Interval time for keepalive probes
[14:42:16] <topranks>	      --no-keepalive  Disable TCP keepalive on the connection
[14:42:36] <hnowlan>	 cdanis: oh good point. it'd work for testing but ultimately this is going to be a call from mediawiki so might be trickier long term 
[14:42:54] <cdanis>	 hnowlan: https://curl.se/libcurl/c/CURLOPT_TCP_KEEPALIVE.html there's a curlopt
[14:43:21] <hnowlan>	 ahh
[14:43:36] <cdanis>	 uh
[14:44:22] <cdanis>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/base/manifests/sysctl.pp#61
[14:44:24] <cdanis>	 ???
[14:45:16] <bblack>	 sorry, been looking elsewhere for a bit
[14:45:26] <bblack>	 maglev is stateless-ish at its own level, but LVS is not
[14:45:31] <topranks>	 cdanis: those are only the defaults if whatever opens the socket doesn't set them itself, my working theory here is curl or whatever else is the client is not using those but creating the socket with keepalives disabled 
[14:45:41] <bblack>	 and our LVS only sees one direction of the TCP traffic (client->server, not server->client direction response).
[14:45:52] <cdanis>	 topranks: yeah it's possible for sure
[14:46:09] <bblack>	 so there is a TCP connection state table involved, and after some minutes of timeout with no client->server packets, it will forget it and then fail when trying to route the ACK of the eventual server->client response data.
[14:46:29] <cdanis>	 I just hadn't realized we already enabled tcp keepalives in production
[14:46:37] <topranks>	 bblack: ok thanks 
[14:46:49] <topranks>	 is that connection state table custom for LVS or just the normal conntrack table?
[14:47:08] <bblack>	 it's an LVS conntrack table IIRC, not the one used by iptables or whatever.
[14:47:53] <hnowlan>	 found a way to pass curlopts to the client, will try that
[14:47:54] <topranks>	 that does somewhat make sense, only seeing one side of the traffic changes the situation vs. what regular conntrack is designed for 
[14:47:58] <bblack>	 in theory, LVS could/should be configurable to do this statelessly as well, esp with maglev.  But such an option doesn't exist in the kernel code.  We looked at patching it in once long ago, but it looked scary :)
[14:48:33] <bblack>	 (it can do stateless for UDP, just not TCP. For UPD it's called -o for "one-packet scheduling")
[14:48:37] <bblack>	 *UDP
[14:48:57] <topranks>	 bblack: I assume that's what we're looking at in these graphs then?
[14:48:58] <topranks>	 https://grafana.wikimedia.org/goto/sYJ9L6mNR?orgId=1
[14:49:20] <topranks>	 ok that is good to know 
[14:49:20] <bblack>	 topranks: some derivative based on it, anyways
[14:49:51] <topranks>	 ok
[14:51:27] <bblack>	 (don't cat this file on big live lvs servers, it can suck a lot of resources spewing a lot of data, but:)
[14:51:30] <bblack>	 root@lvs2011:~# head -20 /proc/net/ip_vs_conn
[14:51:31] <cdanis>	 hnowlan: the keepalives should be pretty obvious in the tcpdump
[14:51:32] <bblack>	 Pro FromIP   FPrt ToIP     TPrt DestIP   DPrt State       Expires PEName PEData
[14:51:35] <bblack>	 TCP 94B221E8 A19D D05099E0 01BB 0AC010B6 01BB ESTABLISHED     898
[14:51:38] <bblack>	 TCP 2601:0281:d301:e350:800e:1f9a:7bce:3cb2 FCCA 2620:0000:0860:ed1a:0000:0000:0000:0001 01BB 2620:0000:0860:0104:0010:0192:0048:0154 01BB ESTABLISHED     878
[14:51:41] <bblack>	 TCP 94B24346 8F5A D05099E0 01BB 0AC01020 01BB FIN_WAIT         63
[14:51:43] <bblack>	 [...]
[14:51:46] <bblack>	 ^ that's the LVS TCP state table
[14:51:54] <bblack>	 (aka IPVS)
[14:52:03] <topranks>	 nice... good to know 
[14:52:32] <bblack>	 and the same basic info is in more-consumable form from "ipvsadm" listing conns (which can also be quite expensive and long-output)
[14:53:13] <topranks>	 yeah I'll avoid actually doing it but good to know it's there 
[14:53:30] <bblack>	 do we really have keepalives somehow turned on for all traffic in the kernel? I thought you could only enable their use, but not force it
[14:54:16] <topranks>	 yeah it's in the base puppet profile, c.danis linked to it above 
[14:54:35] <cdanis>	 bblack: ah you're right
[14:54:46] <cdanis>	 applications still need to setsockopt SO_KEEPALIVE
[14:55:02] <topranks>	 ah ok 
[14:55:06] <bblack>	 yeah I would guess so
[14:55:16] <bblack>	 although you can also change the defaults (it defaults to 2h, which isn't very useful here)
[14:55:16] <topranks>	 so we are setting the default timers to use if an application does not with the sysctls
[14:56:11] <bblack>	 I'm guessing from that quick peek at the state table, that we timeout on LVS tcp state in this case after 900s (15 minutes)
[14:56:38] <topranks>	 bblack: from looking at the graphs (which seem to be the count of the connection table you listed) yeah they expire after 15 mins 
[14:56:55] <cdanis>	 'net.ipv4.tcp_keepalive_time'      => 300, seems good then
[14:57:01] <topranks>	 I mean from looking at the graphs if we say disable pybal on one, it drops off for about 15 mins to 0 
[14:57:04] <topranks>	 300 seems reasonable 
[14:57:44] <topranks>	 from reading online it seems 'tcp_keepalive_intvl' being set to 1 will have no effect, repeat keepalives will be sent every 300 seconds based on 'tcp_keepalive_time'
[14:58:08] <topranks>	 'tcp_keepalive_intvl' only works if it's greater than 'tcp_keepalive_time', well according to one random person anyway :) 
[14:58:13] <topranks>	 https://unix.stackexchange.com/questions/350802/tcp-keep-alive-parameters-not-being-honoured
[14:58:41] <topranks>	 a keepalive every 1 second starting after 5 mins would not be good, which is why I mention 
[14:59:31] <bblack>	 yeah
[14:59:50] <bblack>	 but apparently that's not how it works in practice, is what they're saying there? it would just be 1/s until keepalive-ack?
[15:00:10] <vgutierrez>	 as an example... pybal/liberica healthchecks send a keepalive every 15s
[15:00:46] <vgutierrez>	 oh.. pybal does it every 30s
[15:02:19] <topranks>	 bblack: yeah that's my reading, it's more like how often to re-try sending the keepalive, if it wasn't ACKed
[15:02:41] <topranks>	 assuming all keepalives are ACK'd then they should get sent every 'tcp_keepalive_time' seconds
[15:23:12] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10254615 (10cmooney)
[15:24:01] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10254623 (10cmooney)
[15:26:33] <hnowlan>	 thanks for all the help! Still trying to get the client to play nice for now
[15:31:14] <cdanis>	 hnowlan: if you just want to fix-verify I do think that libkeepalive0 is a totally valid option
[15:31:49] <cdanis>	 (it's an SO_PRELOAD hack)
[15:31:58] <cdanis>	 uh LD_PRELOAD 
[15:32:15] <hnowlan>	 cdanis: absolutely, it's a good option. I've just sniped myself into making the library honour the parameters it claims it does in the docs 
[15:32:33] <cdanis>	 ahah fair enough
[15:33:27] <hnowlan>	 turns out it might mis-name a parameter in the docs and also weirdly supports using bare curl constants as parameters despite it being php
[15:33:31] <hnowlan>	 I'm seeing keepalives now at least
[15:41:51] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10254748 (10cmooney)
[16:02:09] <hnowlan>	 enabling keepalives has fixed the issue - I owe you all a drink next summit. Thank you very much 
[16:03:06] <bblack>	 \o/
[16:03:14] <bblack>	 (for the drinks, not the packets)
[16:08:02] <topranks>	 awesome :) 
[17:00:43] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Manange fundraising network eleemnts from Netbox - https://phabricator.wikimedia.org/T377996 (10cmooney) 03NEW p:05Triage→03Medium
[17:01:19] <wikibugs>	 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 06SRE: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10255334 (10cmooney) Just a note to say that fundraising no longer use any VM infra, so every assigned IP I believe belongs to just a single server.
[17:01:22] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Manange fundraising network eleemnts from Netbox - https://phabricator.wikimedia.org/T377996#10255335 (10cmooney)
[17:01:33] <wikibugs>	 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 06SRE: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10255336 (10cmooney)
[17:01:49] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10255340 (10cmooney)
[17:02:20] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10255341 (10cmooney)
[17:03:13] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10255344 (10cmooney)
[17:25:43] <sukhe>	 topranks: moving here
[17:25:47] <topranks>	 ok
[17:25:50] <sukhe>	 and looking at your pcap now
[17:26:53] <topranks>	 To sum up my thoughts on it everything looks ok, but the 29-second gap between packet 34 and 35 is quite high 
[17:27:17] <topranks>	 I assume that is how long it's taking to run whatever commands it runs on the remote box 
[17:27:37] <topranks>	 I do know the authdns-update does a bunch of stuff so maybe that's normal but it seems fairly high 
[17:27:45] <topranks>	 you said getting via git was slow?
[17:28:16] <sukhe>	 29 seems a bit high but not that much I would say, not that we have any real numbers so it's visceral
[17:28:30] <sukhe>	 to recap: the updates complete but there is a clush timeout
[17:28:34] <sukhe>	 and that's what is bothering me
[17:28:34] <sukhe>	 CLUSH_OPTIONS="-S -B -n -t 10 -u 45"
[17:28:50] <topranks>	 -t 10 suggests 10 second limit?  but it's taking 29?
[17:28:59] <topranks>	 (guessing what the switches do)
[17:29:04] <sukhe>	 10 seconds connect timeout, 45s execution
[17:29:10] <topranks>	 oh ok
[17:29:12] <topranks>	 should be ok then 
[17:29:18] <sukhe>	 so I guess we need to figure out why it complains of a timeout
[17:29:27] <sukhe>	 which means making clush more verbose if everything looks OK on the network side
[17:30:49] <topranks>	 yeah network seems fine, pcap trace is very clear 
[17:37:13] <sukhe>	 so 
[17:37:28] <sukhe>	 running with -d, it seems to be that 45s is not enough for the command execution
[17:38:27] <bblack>	 oh wait, did we recently greatly expand the zonefile list for the markmonitor imports maybe?
[17:38:44] <sukhe>	 good point
[17:38:47] <sukhe>	 but that has been for a while?
[17:39:05] <sukhe>	 I see nothing in the logs to indicate that
[17:39:18] <bblack>	 ok
[17:39:23] <bblack>	 yeah the list doesn't look crazy-long
[17:39:25] <sukhe>	 the most significant change I see so far is the k8s delegations
[17:39:39] <sukhe>	 I doubt they are signficant enough to cause an issue though
[17:39:58] <topranks>	 they're not exactly small but they've been in place a few weeks 
[17:40:05] <bblack>	 can try with a much longer -u manually?
[17:40:06] <topranks>	 https://usercontent.irccloud-cdn.com/file/VGqcOsK9/dns1004_to_dns2005.pcap
[17:40:19] <sukhe>	 topranks: it's also possible though that this was broken all this time and no one else noticed that it was
[17:40:20] <bblack>	 just to confirm
[17:40:43] <sukhe>	 trying -u 60
[17:40:44] <topranks>	 much longer -u sounds like a good test yep 
[17:40:49] <bblack>	 nobody noticed that authdns-update was reporting failure you mean?
[17:40:50] <topranks>	 try -u 300 
[17:40:55] <sukhe>	 bblack: yeah
[17:41:01] <bblack>	 well that's silly
[17:41:01] <sukhe>	 because well the timeout is above the line that says:
[17:41:05] <sukhe>	 OK - authdns updated successfully
[17:41:05] <topranks>	 I've had to run it manually a few times, for things like k8s update 
[17:41:14] <sukhe>	 and two lines before that it says
[17:41:14] <topranks>	 I'd like to hope I'd have noticed a failure!  Is it very subtle?
[17:41:14] <sukhe>	 clush: dns[1006,2004,2006,3003-3004,5003-5004].wikimedia.org: command timeout
[17:41:23] <bblack>	 sukhe: yeah it shouldn't be claiming success if it times out...
[17:41:36] <bblack>	 what's going on here?
[17:42:24] <sukhe>	 -u 60 worked
[17:42:25] <sukhe>	 cleanly
[17:42:31] <sukhe>	 No action needed, zones and config files unchanged
[17:42:31] <sukhe>	 OK
[17:42:31] <sukhe>	 ---------------
[17:42:31] <sukhe>	 dns[1005-1006,2004-2006,3003-3004,4003-4004,5003-5004,6001-6002,7001-7002].wikimedia.org (15)
[17:42:34] <sukhe>	 ---------------
[17:42:36] <sukhe>	 OK - authdns updated successfully
[17:42:39] <sukhe>	 OK - authdns-update successful on all nodes!
[17:42:46] <sukhe>	 I think we need to improve the messaging here a bit
[17:42:54] <sukhe>	 ---
[17:42:58] <sukhe>	 when it fails on some nodes, it says:
[17:42:58] <sukhe>	 OK - authdns updated successfully
[17:43:08] <sukhe>	 when it passes on all nodes it, it says:
[17:43:11] <sukhe>	 OK - authdns-update successful on all nodes!
[17:43:11] <bblack>	 yeah because the bash script is supposed to abort if clush returns non-zero, due to "set -e"
[17:43:37] <sukhe>	 bblack: what does a clush timeout mean here though? because it clearly still ran the command even though it timed out
[17:43:38] <bblack>	 oh I see
[17:44:02] <bblack>	 "OK - authdns updated successfully" is the output of authdns-local-update
[17:44:17] <bblack>	 which it's reporting in the clush output, I guess
[17:44:34] <bblack>	 so we need to stop relying on "set -e" down there, and print an explicit final failure message?
[17:45:33] <bblack>	 and yeah, maybe we have to increase timeout as well, I'm sure it was fairly abitrary and old
[17:45:45] <sukhe>	 I am still not sure what changed though and that part bothers me
[17:46:09] <sukhe>	 topranks: thanks for your help anyway and sorry for the noise. my first impulse should have been to just check clush but what didn't help was that it felt that pulling in updates from dns.git was also taking longer
[17:46:28] <topranks>	 would that error show when running the sre.dns.netbox cookbook?
[17:46:37] <bblack>	 sukhe: I assume some combination of various sources of ever-expanding zone data
[17:46:38] <topranks>	 I found one from a few weeks back in my terminal scrollback
[17:46:38] <topranks>	 https://phabricator.wikimedia.org/P70572
[17:46:40] <sukhe>	 it should 
[17:46:47] <topranks>	 don't see it I think 
[17:47:31] <sukhe>	 ah wait, no, looking at netbox.py
[17:48:37] <topranks>	 sukhe: when you set a longer timeout manually did it work?
[17:48:37] <bblack>	 I'm experimenting on dns1006 with the timing, to see what's at the bottom of this
[17:48:47] <sukhe>	 topranks: yeah, -u60 worked just fine
[17:48:53] <topranks>	 ok right 
[17:49:12] <bblack>	 root@dns1006:~# time authdns-local-update dns2004.wikimedia.org
[17:49:12] <bblack>	 OK - authdns updated successfully
[17:49:13] <bblack>	 real	0m45.282s
[17:49:13] <sukhe>	 topranks: and the netbox.py thing is that because it doesn't call authdns-directly, it calls deploy-check which reloads gdnsd
[17:49:25] <bblack>	 ^ that seems wrong, how long it's taking there...
[17:49:27] <sukhe>	 so you won't see the same output
[17:50:12] <topranks>	 ok
[17:50:49] <bblack>	 I'm gonna dig on the authdns-local-update $FQDN timing a bit
[17:50:56] <bblack>	 (testing on dns1006 with different parts)
[17:52:38] <bblack>	 pretty much all of the ~45s delay happens in this step:
[17:52:42] <bblack>	 Assembling and testing data in /tmp/dns-check.ap2yxdzf -- Generating zonefiles from zone templates
[17:52:45] <bblack>	 -- Processed 594 zones into directory /tmp/dns-check.ap2yxdzf/zones
[17:52:57] <bblack>	 (between the generating... and processed lines)
[17:53:10] <bblack>	 so yeah, it's something about our zone data that's making it take forever lately
[17:53:19] <bblack>	 probably the deploy script is doing something horribly inefficient
[17:53:28] <bblack>	 that'snot even a network issue, just local data processing
[17:54:05] <bblack>	 for now, up the timeout in authdns-update to at least 60, maybe like 90 just to be sure
[17:55:06] <bblack>	 the time is all in utils/gen-zones.py
[17:55:22] <sukhe>	 I am guessing then this has been broken for a while
[17:55:24] <bblack>	 which apparently I have the vast majority of the blame lines for, go figure :P
[17:55:29] <bblack>	 well not broken, just very slow
[17:55:41] <bblack>	 not scaling well with increases in zone data (file counts and/or data size)
[17:55:51] <sukhe>	 yes broken is not the right word here because it still pushes out the changes so 
[17:56:09] <bblack>	 but for immediate action items:
[17:56:14] <bblack>	 1) Increase timeout
[17:56:37] <bblack>	 2) Fix up the bottom of authdns-update to explicitly output a FAILED kind of message if clush returns non-zero
[17:56:59] <bblack>	 right now the output as-seen is confusing if it succeeds on some and times out on others
[17:57:30] <sukhe>	 yeah on it. I will try to see how to imrpove the final output message as well
[18:02:37] <sukhe>	 I am getting just the timeout change out for now to get a basic thing working
[18:02:43] <sukhe>	 then we can figure out the rest of the stuff
[18:52:45] <bblack>	 just kind of scoping out the gen-zones problem a bit
[18:53:05] <bblack>	 it is a lot of data these days.  Once all the symlinks are unwrapped and all the netbox stuff is included, etc...
[18:53:25] <bblack>	 ~21MB of zone data
[18:53:50] <bblack>	 1013 total zonefiles/includefiles of some kind
[18:54:06] <bblack>	 and there's jinja templating on all of it, etc
[18:54:19] <bblack>	 there's probably lots of ways we could make "gen-zones" more efficient
[18:54:56] <bblack>	 I'm not even quite sure why exactly we unwrap all the symlinks
[18:57:02] <bblack>	 537 of those files are all of one of 3 main types (identical contents by md5sum, just for different zone names)
[18:57:17] <bblack>	 mostly parking
[18:57:37] <bblack>	 but the parking outputs are ~36KB files or whatever.  Because they include the whole list of language hostnames and mobile subs, etc
[18:58:05] <bblack>	 probably need to refactor the whole gen-zones bit and make less-lazy choices about efficiency
[18:58:41] <bblack>	 if even the jinja outputs are identical, we could be doing that templating just once, and then symlinking it like we do in the original repo
[19:00:21] <bblack>	 mostly the jinja templates were just to insert commitmsg/commitdate info into the output files for later human debugging
[19:00:43] <bblack>	 e.g. near the top of the 10.in-addr.arpa file we have a templated line:
[19:00:46] <bblack>	 ; 2024101808 7847d6b6 Remove obsolete api records
[19:01:00] <bblack>	 ^ which indicates the date, hash, and first line of commitmsg of whatever last affected that file
[19:01:10] <bblack>	 (even if it was affecting the underlying file for a symlink)
[19:01:46] <bblack>	 but the new k8s pod stuff uses it creatively too:
[19:02:00] <bblack>	 {% for z in range(64,72) -%}
[19:02:00] <bblack>	 $ORIGIN {{ z }}.64.@Z
[19:02:00] <bblack>	 {% for i in range(256) -%}
[19:02:00] <bblack>	 {{ i }} 1H IN PTR       kubernetes-pod-10-64-{{ z }}-{{ i }}.eqiad.wmnet.
[19:02:03] <bblack>	 {% endfor %}
[19:02:05] <bblack>	 {% endfor %}
[19:02:05] <bblack>	 (in a few places)
[19:03:26] <bblack>	 that creates ~4K lines in that file
[19:03:46] <bblack>	 so yeah, I donno, just thinking out loud.  will look at this some more later
[19:04:02] <bblack>	 should probably profile gen-zones and figure out what's really costing us
[19:04:40] <sukhe>	 can do that
[19:04:58] <sukhe>	 for now at least I fixed the above clush issues and we can look at this 
[19:05:20] <sukhe>	 we can probably dig up the exact number but even looking at 
[19:05:30] <sukhe>	 13:52:45 < bblack> -- Processed 594 zones into directory /tmp/dns-check.ap2yxdzf/zones
[19:05:36] <sukhe>	 that's certainly an increase somewhere
[19:06:15] <bblack>	 well a lot of that is just all the new parking templates
[19:09:59] <bblack>	 the generating...processing step that took ~45s in prod takes ~8s on my laptop on python3.12 :P
[19:10:19] <bblack>	 the tests are only set up to run on py3.7 and 3.9, had to trivially hack the tox config
[19:11:16] <bblack>	 unstable doesn't even ship a python that old
[19:13:14] <sukhe>	 didn't do a bisect because linear search ftw (!) but
[19:13:31] <sukhe>	 current zones 594
[19:13:36] <sukhe>	 prior to c411a0378ba08a7660b92eb93c422ea392c50ea1
[19:13:38] <sukhe>	 217
[19:13:39] <sukhe>	 so yeah
[19:13:42] <sukhe>	 that is where the bloat happened
[19:13:50] <bblack>	 well it's where the file count bloat happened
[19:13:53] <sukhe>	 yes
[19:14:01] <sukhe>	 well
[19:14:01] <bblack>	 it could still jus tbe the k8s pods templating that added a lot of time
[19:14:05] <sukhe>	  -- Processed 217 zones in dry-run mode
[19:16:29] <bblack>	 but yeah I think it is just the count.  I did some basic printf-debugging and there's no big pause when processing 10.in-addr.arpa
[19:16:39] <bblack>	 just takes a while to spew its way through the 594 zones
[19:17:21] <bblack>	 still, it's significantly faster on my laptop.  I assume newer-python optimizations of some kind.
[19:19:14] <bblack>	 py312-zones: OK (27.60=setup[1.00]+cmd[26.60] seconds)
[19:19:33] <bblack>	 ^ but ~20s of that is just checking out the ops/dns git repo remotely over the internet
[19:19:57] <bblack>	 then ~8s to rip through the zonefile templating junk
[19:21:04] <sukhe>	 I think like I said previously, either I am dreaming or the ops/dns checkout is slow today/has become slower
[19:21:13] <sukhe>	 so there might be some of that on the dns hosts compared to local
[19:21:28] <sukhe>	 it is fast for me locally but then again, it's not an exact comparison in that sense
[19:23:04] <bblack>	 it might be, but I was definitely observing, when doing: dns1006# authdns-local-update dns2004.codfw.wmnet
[19:23:25] <bblack>	 that the git checkout was negligible, but ~45s time passed during the gen-zones step
[19:25:17] <sukhe>	 ok. so for now the clush timeout is 90s and we error out if it failed
[19:25:33] <sukhe>	 I don't have any grand ideas on how to fix the templating issue so maybe we can revisit it if the need arises :)
[19:25:49] <sukhe>	 if you think have some ideas, I am all ears
[19:26:03] <sukhe>	 otherwise I think this is fine, most of the parking templates are in *I think*
[19:27:03] <bblack>	 I locally hacked my printf stuff onto dns1006 temporarily and confirmed, it's just the raw volume of it and some inefficiency that doesn't happen on my lapto, it's not the ks pods
[19:27:25] <bblack>	 10.in-addr.arpa wasn't particularly slow, but in general it was slow making its way through the 594 zone list
[19:27:42] <bblack>	 every zone is just a bit slow, enough that it takes ~45s to do ~600 of them.
[19:28:32] <bblack>	 maybe this version of python+modules we have on prod dns, is just terrible at file i/o or ninja templating in a way that my laptop's versions aren't
[19:28:40] <bblack>	 *jinja
[19:28:45] <sukhe>	 it doesn't happen on mine either, 3.12. we are on 3.11 on the DNS boxes, and apparently it seems there are performance improvements in 3.12. that might explain it, or faster local SSDs or something
[19:28:55] <sukhe>	 doesn't happen on mine -> it's pretty quick
[19:28:59] <bblack>	 are we on 3.11?
[19:29:14] <sukhe>	 yeah
[19:29:21] <bblack>	 well, the tox.ini for ops/dns.git only specifies python3.7 and 3.9
[19:29:24] <sukhe>	 python3 --version
[19:29:24] <sukhe>	 Python 3.11.2
[19:29:27] <bblack>	 that's why I was confused about how far back prod is
[19:29:37] <bblack>	 good that we're not CI-checking on versions we actually use? :P
[19:29:40] <sukhe>	 ah tox.ini
[19:29:51] <sukhe>	 yeah that can certainly be fixed
[19:29:53] <sukhe>	 and should be
[19:29:54] <sukhe>	 patching it
[19:30:04] <sukhe>	 oh no wait
[19:30:54] <bblack>	 FWIW 3.12's flake8 found a minor issue to nitpick in deploy-check.py
[19:31:08] <bblack>	 line ~536 or something:
[19:31:09] <bblack>	 -        if type(other) != DNSRecord:
[19:31:09] <bblack>	 +        if type(other) is DNSRecord:
[19:31:11] <sukhe>	 so deploy-check calls subprocess.run(['utils/gen-zones.py', str(tdir_zones)], check=True)
[19:31:23] <sukhe>	 yeah that's the thing that taavi submitted a patch for yesterday I think
[19:32:01] <bblack>	 where are you going with the subprocess.run() thing?
[19:32:47] <sukhe>	 figuring out at which stage of the process it was called
[19:32:53] <bblack>	 ah
[19:36:25] <bblack>	 by ssd disk i/o standards, ~1K files and ~21MB is nothing, so I assume it's not SSD or filesystem speeds
[19:36:29] <bblack>	 probably something in py3.12 and/or modules updates
[20:43:11] <wikibugs>	 06Traffic: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10256327 (10BCornwall) 05Open→03In progress
[20:48:48] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256380 (10cmooney)
[20:53:24] <wikibugs>	 06Traffic, 06Data Products, 06Data-Engineering: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute - https://phabricator.wikimedia.org/T375256#10256402 (10Ottomata)
[20:54:19] <wikibugs>	 06Traffic, 06Data Products: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute - https://phabricator.wikimedia.org/T375256#10256404 (10Ottomata) cc @Milimetric should we follow upon this?
[20:58:22] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256417 (10cmooney)
[21:02:29] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256437 (10Jclark-ctr) @cmooney  all cables have been connected for Step 2: Initial cabling for the new de...
[21:05:17] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256451 (10cmooney) >>! In T377381#10256437, @Jclark-ctr wrote: > @cmooney  all cables have been connected...