[04:51:39] 10Wikimedia-Apache-configuration, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 (10Marostegui) [04:51:46] 10Wikimedia-Apache-configuration, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 (10Marostegui) [04:52:49] 10Wikimedia-Apache-configuration, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 (10Marostegui) 05Open>03Resolved a:03chasemp If the only hosts pending were db1095 and db1102 this can be... [04:52:53] 10Wikimedia-Apache-configuration, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 (10Marostegui) 05Open>03Resolved a:03chasemp If the only hosts pending were db1095 and db1102 this can be... [07:02:55] 10Wikimedia-Apache-configuration, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 (10jcrespo) As I said, I fixed this at https://phabricator.wikimedia.org/T187850#4078495 [07:03:05] 10Wikimedia-Apache-configuration, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850 (10jcrespo) As I said, I fixed this at https://phabricator.wikimedia.org/T187850#4078495 [08:16:34] something interesting that I found today while reviewing vk metrics [08:16:36] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&from=now-7d&to=now&var-instance=eventlogging&var-host=All [08:16:58] there seems to be a sporadic issue only for the varnishkafka-eventlogging instance in eqsin [08:17:41] those delivery errors correspond to connect timeouts with high latencies (>=~500ms) reported in the vk logs [08:17:58] only for this specific instance, only for eqsin hosts [08:19:27] if I read https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png from eqsin we go through either ulsfo->eqord/codfw->eqiad or codfw->eqiad right? [08:20:08] given the latencies of the link reported (awesome work XioNoX) 500ms seems not difficult to reach [08:20:43] yeah, is there any specific vk timeout that we can work with? [08:34:43] we can check the last peak registered [08:34:55] the main question mark is why it happens only for one instance [08:35:12] there might be some data/events specific issue that comes up with higher latencies? [08:35:41] maybe some IP is messing up somehow with the app and generating big/fat events? [08:35:43] (like how big a json event is, etc..) [08:36:10] yeah exactly, it might be an explanation [08:36:30] can you graph the size of the events? [08:37:20] there is https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=31&fullscreen&orgId=1&var-instance=eventlogging&var-host=All&from=1530787316428&to=1530788233113 [08:37:40] but I don't recall exactly what it means, if it is the overall sum of the buffer queue or something else [08:37:43] checking [08:40:22] https://github.com/edenhill/librdkafka/wiki/Statistics [08:40:34] "Current total size of messages in instance queues" [08:41:18] so librdkafka does not send events immediately, but it buffers them in a queue.. if there is delay in sending (batches) of events, everything slows down [08:41:37] so the bigger size might mean only more messages in there, not necessarily bigger ones [08:41:45] (at least this is my impression) [08:58:37] elukey: note that we have already seen in the past eqsin<->eqiad rtt being consistently higher for a certain host (cp1062 in that case) https://phabricator.wikimedia.org/T157430#3989077 [09:01:52] interesting! [11:18:53] those in the network side of things may know more of this than me [11:19:37] I read from some rfcs that hostnames should not be longer than 63 characters, and fqdn no longer than 255 ones, does that sound correct? [11:22:30] "If you sit down and do the math, you'll see that the the readable maximum length of an ASCII DNS name is 253 characters" [11:22:45] from; https://blogs.msdn.microsoft.com/oldnewthing/20120412-00/?p=7873/ [11:23:07] I was just reading that [11:23:24] also take into account that every label is restricted to 63 chars [11:23:30] not only the first one (hostname) [11:23:38] I don't really need lower bounds, I am just building a database and creating the types [11:24:22] to be super safe, I created hostnames with varchar(100) ascii and fqdn as varchar(300) ascii [11:24:30] I think that should be enough [11:25:32] also you never know how people are abusing those in ways that violate rfcs [13:15:23] there's a lot of tricky distinction between "hostname" and "domainname label" and such, the true limits are quite hard [13:15:48] (you can have names that are illegal "hostnames" for a host, but are legal in the DNS protocol as "hostnames", and vice-versa) [13:17:21] but something that will be resolveable via the DNS, there are hard rules: labels (things between the dots in ascii) can't go over 63, and the total length in ASCII effectively can't be longer than 254 bytes [13:18:08] (you only reach 254 if you include a trailing dot to deterministically terminate the name, as opposed to letting some client library try to append a domainname. e.g. "wikipedia.org.". It's 253 in ascii if you don't include that optional trailing dot) [13:19:37] oh, but that also assumes none of the ascii letters needed escaping, which would then expand it more depending on the escaping mechanism (the official DNS one up to 4 bytes per char I think) [13:19:45] I saved a bunch of notes to self on this topic in a source code comment in: [13:19:55] https://github.com/gdnsd/gdnsd/blob/master/include/gdnsd/dname.h#L33 [14:30:15] as a follow up from this morning's discussion about vk, I found something interesting missing for statsv/eventlogging - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444232/ [14:30:38] this in theory should help in reducing the amount of bytes sent from those vk instances [14:49:32] bblack: so, we have this ready: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444005/ ema wisely suggested to merge it on Monday, just to keep deploy gods happy :) [14:59:21] bblack: adding a wishlists to the queue - the kafka TLS review work seems finished, would it be ok for you to start thinking about removing ipsec between cp* and kafka jumbo nodes? [15:30:40] vgutierrez: yeah, monday sounds good :) [15:31:19] elukey: yeah I think we're there, next week should do some final verification and pull the plug probably [15:31:23] (on ipsec I mean) [15:32:12] <3 [15:35:59] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410 (10Jc86035) [15:37:58] bblack: \o/ [16:53:08] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) >>! In T198623#4397016, @elukey wrote: > I am pretty sure that this is a pre-scap thing, we should drop it :) Great! > Other thi... [16:59:46] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) @ayounsi are we sure that we can touch common-infrastructure4 without affecting anything else? Is there any trace of who made it... [17:20:21] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) >>! In T198623#4403989, @elukey wrote: > @ayounsi are we sure that we can touch common-infrastructure4 without affecting anythi... [18:49:19] 10Traffic, 10netops, 10Operations, 10ops-codfw: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ayounsi) NIC2/3/4 configured as of the current task description.