[06:27:28] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) >>! In T254332#6523115, @mforns wrote: > This, I assume, needs to be added to the pmacct producer? That's correct,... [09:09:08] 10Traffic, 10Operations, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10ema) It seems that this task is starting to follow the communication pattern that emerged in T238494, which wasn't great. Flagging it early on to try to avoid the unpleasan... [09:22:05] 10Traffic, 10Analytics-Clusters, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10Vgutierrez) yeah.. I'll handle the backport :) [09:25:11] volans: <3 --^ [09:25:14] err vgutierrez [09:25:19] also love to volans of course [09:25:23] ahahah [09:25:26] rotfl [09:25:27] I'm not jealous [09:25:36] I know I know :) [09:33:00] 10Traffic, 10Analytics-Clusters, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10Vgutierrez) @elukey double checking https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/varnish4/+/refs/heads/debian-wmf/lib/libvarnishapi/vut.c#424 it looks like ed1... [09:39:00] 10Traffic, 10Operations, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) It seems normal to me to paste screenshots of findings, create tasks when a significant regression is witnessed, verify the cause via a targeted rollback. Performan... [09:49:07] 10Traffic, 10Analytics-Clusters, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) ah snap I didn't check, will do the next time using the gerrit repo. It is not clear to me why we have all those fstat calls though.. [10:00:49] vgutierrez: it is so weird, the bug's description matches well with what I am seeing [10:09:58] yup [10:15:39] 10Traffic, 10Operations, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) On the topic of reliable metrics, our RUM monitoring is reliable, battle-tested and the ultimate truth about the performance users really experience. While it can't... [12:08:18] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Thanks @ayounsi > `"comms": "14907:0_14907:2_14907:3"` > To mean that the flow has the 3 communities 14907:0 14907... [12:44:10] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can... [12:51:13] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) OK, after a very interesting chat with Joseph, here's our conclusions: * It would be cool to have the core of the r... [13:19:07] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi > Ok to merge anytime or should I sync up with you? I believe it's OK to merge, and that Refine should id... [13:36:53] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) [13:37:00] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) p:05Triage→03High [14:03:17] 10netops, 10Operations, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) [14:08:56] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) [14:33:42] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10CDanis) I don't think the reflection is a concern; everything from NTP to memcache to OpenVPN to CLDAP is a better reflector by orders of magnitude. Apply... [14:36:31] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) >>! In T264888#6525436, @CDanis wrote: > I don't think the reflection is a concern; everything from NTP to memcache to OpenVPN to CLDAP is a better... [14:38:59] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10CDanis) +1 from me then :) [14:45:29] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10ayounsi) As discussed over IRC, that LGTM. Would it be easy to rollback if there are any issues or the result is not as fast as expected? [14:50:54] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) > Would it be easy to rollback if there are any issues or the result is not as fast as expected? Yes and although the current CR doesn't include th... [14:52:37] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10MoritzMuehlenhoff) Agreed, the service enumation/information disclosure angle is moot for us, so let's give this a shot. If we make it configurable via Hie... [14:58:34] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) >Maybe leaning towards using a transform function, because code would be shorter and less moving pieces? I think havi... [15:08:52] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Cmjohnson) [15:13:34] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [15:18:24] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) Hi @CDanis! This must be related to our new widgets, which request... [15:23:16] 10netops, 10Operations, 10ops-eqiad, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) [15:24:07] 10netops, 10Operations, 10ops-eqiad, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) 05Open→03Resolved ran script for the old asw2-d4 and changed name to old-asw2-d4. Changed name in netbox from asw3-d4 to asw2-d4 [15:53:16] it's weird, dig smn.wikipedia.org resolves to this: [15:53:26] ;smn.wikipedia.org. IN A [15:53:52] And in python code the host resolves to 134.119.244.166 (random place in Germany) [16:00:58] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) [16:03:09] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) @ema @bblack before I build it, I want to confirm that some complimentary information you're looking for is the ability to break down... [16:08:41] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [16:19:24] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) [16:24:14] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) a:03Dmantena [16:49:16] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10BBlack) >>! In T264398#6525968, @Gilles wrote: > @ema @bblack before I build it, I want to confirm that some complimentary information you're... [16:49:52] 10netops, 10Operations, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10Volans) @jbond FYI as a side note once this is deployed we can probably revisit a bit the firewall rules of the failoid hosts, that were designed already w... [16:50:25] bblack: if you're around I've a question for you wrt the consolidation of the auto-generated from netbox dns zonefiles [16:59:41] volans: I can maybe answer asyncishly [17:00:15] 10Traffic, 10Operations, 10Technical-blog-posts: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema Overall, the post is well-written and interesting! I made some minor grammar suggestions. Can you accept / reject them, and I'... [17:00:46] sure, the tl;dr is: how bad would it be to have identical duplicated records for "some" records for a short amount of time? say 1h at most, the time to migrate all the PoPs to the new scheme. [17:01:34] if too bad I'll find a better solution ... (actually I just got an idea... Ill try it in a short bit) :) [17:07:21] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Thanks so much @Tsevener ! Another thing I wanted to ask, while we'... [17:28:37] volans: wouldn't we just switch the includes? [17:29:25] (thus making the ops/dns commit that changes "$INCLUDE little-file-a; $INCLUDE little-file-b; ..." to "$INCLUDE combined-big-file" the atomic point of switching) [17:29:42] yes but in few cases the combined-big-file is already included :/ [17:29:47] ah [17:30:07] are they all A/AAAA/PTR, or are there now other record types coming from netbox? [17:30:08] because we have prefixes with IPs and subprefixes [17:30:27] from netbox only A/AAAA/PTRs are coming and this one is only for PTRs [17:30:59] so yeah, I don't think the dupes are a big deal in practice, assuming they don't cause authdns-update to fail some strictness check at either the python or gdnsd level [17:31:01] the consolidation is needd only for reverse, the direct ones are already grouped in the only possible way [17:31:13] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis we can definitely smear the requests - is there a minimum... [17:31:40] I don't even remember if gdnsd's strict-level warning stuff looks at duplicate PTRs, checking [17:31:58] thx [17:33:01] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi Confirmed that you can merge the changes that add BGP communities to pmacct! We'll be monitoring the kafka... [17:34:05] CI would fail if they were in the manual repo, but zone_validator doesn't check the generated zonefiles [17:40:38] bblack: the alternative is to do all 3 PoPs at once using https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/632697, change the generation, pull it on the the authdns hosts but skip the deploy, then merge the manual patch on the dns repo and run authdns update [17:41:01] it looks like the only warning-level check on PTR in gdnsd is the same-zone one, so it should be fine [17:41:40] same-zone as in 2 records in the same zone or 2 zones with the same name? [17:42:02] ("the same-zone one" being: it warns (which we upgrade to fail) if a PTR record in, say, 10.in-addr.arpa, has a target that's also in 10.in-addr.arpa, which is usually because you forgot the trailing dot) [17:42:24] e.g. you have the line in there: "130 1H IN PTR frbast2001.frack.codfw.wmnet" [17:42:39] ah ok [17:42:53] without the dot after wmnet, that's actually interpreted as "frbast2001.frack.codfw.wmnet.10.in-addr.arpa." [17:43:14] other than that sanity-check, it doesn't care what data PTR records have, dupes ar efine [17:44:00] ok great, if we're ok with dupes too I'd like to proceed in that sense [17:44:07] +1, go for it [17:44:11] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis If the widgets don't have what they need when we request a... [17:44:40] thanks! most likely tomorrow, I want to be fresh and double check I'm not loosing any record in the process :) [17:44:52] ok :) [17:44:57] thanks a lot! [17:55:40] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) >>! In T264881#6526357, @Tsevener wrote: > @CDanis If the widgets do... [18:13:52] 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) It's not difficult but it does take a couple of days before it'll... [20:13:33] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I'm happy to work with the Traffic team to increase knowledge on web performance, which metrics matter and why, help you define your...