[07:34:12] 10netops, 10Operations, 10monitoring: "MySQL server has gone away" from librenms logs - https://phabricator.wikimedia.org/T171714#3492098 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The move should have been completed by the 26th but indeed the errors are gone, tentatively resolving [09:31:32] 10Traffic, 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3492337 (10ema) @Cmjohnson any news? [10:04:25] cp4018 (text ulsfo) is running with nginx-lua metrics for prometheus enabled as of now [10:04:28] let's see how it goes! [10:11:47] <_joe_> \o/ [10:29:19] is also prometheus already polling it btw? [10:34:57] godog: nope, we've only added the config for cache_misc [10:50:55] 10Traffic, 10netops, 10Operations: Poor conectivity (Vodafone/THUS in UK) - https://phabricator.wikimedia.org/T172262#3492515 (10Marostegui) [11:57:24] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3492728 (10mark) No objections from me. It does add complexity somewhat and will probably add some failure modes where both ports fail to come up (LACP failin... [12:04:04] 10Traffic, 10netops, 10Operations: Poor conectivity (Vodafone/THUS in UK) - https://phabricator.wikimedia.org/T172262#3492487 (10mark) The paste (traceroute) you provided already shows exceptionally high (1s+) latency to your first hop (either your home router or the first ISP router), so this doesn't look l... [12:25:30] ema: any chance you could review those outstanding pybal unit tests sometime soon? [12:25:40] perhaps I'll work some more on it on the plane next week [12:38:39] mark: nice, for sure [12:38:51] thanks [12:38:53] obviously not urgent [14:04:07] so last night I was running back over, in my head, how to make an NSS module that's flexible enough for whatever needs, and yet doesn't have a grossly-complicated config and runtime structure to it (about nested and/or parallel pools of servers with separate timeouts, etc) [14:05:18] I think I see now, that the easy way is to structure the configuration of nameservers as a serial list with a wait timeout after each, and not care about duplicates, etc [14:05:40] so the idea is the NSS plugin creates a single socket to send the request(s) on and assigns a dns query randomized ID to use [14:06:34] bblack: you mean a list like a, a, a, b, b, c, c, c... ? [14:06:39] and the config looks like, e.g.: 10.0.0.1,0.1; 10.0.0.2,0.3, 10.0.0.3,0; 10.0.0.4,0.5; .... [14:07:00] each is a nameserver destination IP (well +port in the real world) and a timeout value after it [14:07:16] and what it's explicitly controlling is the pacing of sending outbound requests while waiting for any one of them to finally respond [14:07:40] so the above means send to .1, wait 100ms, send to .2, wait 300ms, send immediately to .3 + .4, wait 500ms [14:07:50] by putting zero-timeouts after some, you can make parallel sets, basically [14:08:18] even if you made a more-complex structuring, it all boils down to a linear behavior like that anyways, and it's simpler [14:08:54] and then one overall timeout value (if no answer within X seconds, fail all the way to the caller), and the module loops through the list of "IP to send to + timeout before next send" until it hits the overall timeout [14:09:07] the user is in control of timing and exponential backoff, etc [14:09:58] so if you want it to try the first server 3x with backoff timeouts before moving to a parallel pair next, it's like [14:10:24] and the first that answer win I guess [14:10:41] 10.0.0.1,0.2; 10.0.0.1,0.5; 10.0.0.1,1; 10.0.0.52,0; 10.0.0.53,1 [14:11:09] (send 3x to first server waiting 200ms, 500ms, 1000ms, then try the next two servers in parallel waiting 1000ms, then start over (until global timeout reached) [14:11:22] and yeah, in all cases if any answer comes in response to any past query, we accept it and stop processing [14:11:43] (since they're all the same source port + dns ID, any answer will do, we're waiting on all possible answers in parallel on the same socket) [14:11:47] windows 8.1 introduced some funky stuff in that regard https://technet.microsoft.com/en-us/library/dn305896(v=ws.11).aspx [14:12:07] DNS query adaptive timeout is even a cool name! :) [14:12:40] this won't be very adaptive, more like "explicit operator control because they know their network infra and their timeout demands" [14:13:25] structuring it like the above makes it relatively simple to implement, which is key, because C shlib code hiding under glibc is dangerous [14:13:57] just need the list of [server,timeout], a global timeout option, and then the standard little options like ndots,search,domain,etc [16:50:10] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3493917 (10elukey) After a chat with @Ottomata we realized that applying the correct namespace (`'kafka.'` prefix to all the li... [16:50:28] 10Traffic, 10Analytics-Kanban, 10Operations, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3493918 (10elukey) [17:24:57] 10netops, 10Operations, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3494084 (10ayounsi) 05Open>03Resolved [18:14:42] ema, bblack that cp1099 is still alerting [19:59:40] 10Traffic, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3494814 (10Krinkle) [19:59:51] 10Traffic, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3224448 (10Krinkle) [20:22:39] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3494940 (10DStrine)