[10:50:58] 10netops, 10Operations: Fix LibreNMS alert "CDR bills over 75% used" - https://phabricator.wikimedia.org/T247949 (10ayounsi) 05Open→03Resolved Done, with a threshold moved to 90%. And runbook updated. The only limitation is that it will alert for every devices that are part of that traffic bill, which is a... [12:12:00] 10netops, 10Operations, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) LGTM! I forgot one step: write doc :) - There should be a wiki page (eg on https://wikitech.wikimedia.org/wiki/Storm_control) that explains what it is, where/how it's de... [12:57:22] 10Traffic, 10Analytics, 10Operations, 10Research, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) THanks @elukey for this summary. There are two macro use-cases for the release/simplified access to article pageview... [15:08:15] 10Traffic, 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10WDoranWMF) [15:17:36] 10Traffic, 10Operations, 10Privacy Engineering, 10Privacy: Disable WMF-Last-Access cookies for wmfusercontent.org - https://phabricator.wikimedia.org/T210167 (10JFishback_WMF) [15:23:16] hiya ema! are there any timeouts in frontend http/https termination I'm not aware of? [15:23:23] re https://phabricator.wikimedia.org/T242767 [15:23:36] it seems connections to stream.wm.org are disconnected after exactly 15 minutes [15:23:51] but they stay alive using the internal discovery url. [15:26:39] ottomata: hi! [15:26:47] ottomata: that might be proxy.config.http.transaction_active_timeout_in [15:26:54] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy-config-http-transaction-active-timeout-in [15:27:24] can it be lua'd on and off for a conn, based on us knowing it's a long-lived stream.wm.o conn? [15:27:35] (or value-change maybe rather than on/off) [15:28:45] the docs say it's both reloadable and overridable, so we should be able to unset it for stream.wm.o only yeah [15:30:21] ema: ah ha! cool. shall I try to make a patch (is this in puppet?) or can you fix? [15:32:39] ottomata: this specific setting isn't puppetized yet, I'll fix! [15:34:55] 10netops, 10Operations: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [15:57:34] ema: thank you! [15:58:13] Did we manage to beat traffic records yesterday? [16:00:38] See jynus in https://mobile.twitter.com/jynus/status/1241994650908532737 [16:09:18] 10netops, 10Operations: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10ayounsi) Discussed it with Chris on IRC, LGTM. [16:25:24] 10Traffic, 10Analytics, 10Analytics-Kanban, 10EventStreams, 10Operations: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) a:05Ottomata→03ema [16:54:38] RhinosF1: https://twitter.com/jynus/status/1242494884210266113 [16:56:49] jynus: cool [17:00:39] bblack: if you're around, for the actual run of deploy-check on the authdns hosts, do you prefer all at once or batched? [17:00:52] the one triggered by the netbox cookbook ofc [17:54:35] Is it ok for me to answer to INCIBE-CERT that they got a false positive again? Or probably vgutierrez should do it like last time? [17:55:14] hmmm go ahead [18:59:40] volans: all at once is probably best [19:00:03] bblack: ack, that's what the patch I've sent would do if not modified [19:00:06] volans: (although it wouldn't be an awful idea to try just one, the very first time, manually!) [19:00:38] sure, we can do as much deploy tests as we want :) [19:01:12] I hope I got the command to execute right from our last chat [19:01:19] it looks right [19:01:22] and looking at the local deploy script [19:01:28] *local update [19:01:49] what's the worst that could happen anyways? :) [19:01:49] and that's executed by root AFAICT correct? [19:01:53] eheheh [19:01:54] yes [19:01:59] good question :D [19:01:59] well [19:02:17] yes, root [19:02:41] I was going to say, there is a less-privileged account used to ssh around for authdns-update, but I think it sudos back before executing the real code [19:03:34] the files in /etc/gdnsd are owned by root at the moment [19:04:23] yeah [19:06:16] as for next steps (apart a quick manual test of that command) what we want to do and when would be a good time for it? [19:06:35] I was thinking to start with one PoP mgmt ifaces, like ulsfo [19:07:09] so, the netbox repo has basically all the mgmt/network stuff, but not the real host-level stuff [19:07:26] and the initial checkout will have all of those mgmt/network snippets deployed (to /etc/gdnsd/...) [19:07:40] but we don't have any of them $INCLUDE-d from the ops/dns zonefiles yet, right? [19:08:04] so when you say start with ulsfo mgmt, you mean as a test for the first live $INCLUDE replacing the existing manual records with an ops/dns change? [19:08:16] yes (pretty much to all) [19:08:36] ok [19:08:41] sounds like a good plan! [19:09:01] the bit that is slightly incorrect is that /etc/gdnsd/zones has already a netbox subdir [19:09:11] with snippets that are not included [19:09:23] oh? [19:09:50] when we merged last week the deploy-check.py changes you reviewed too [19:10:02] it has a git checkout I think? [19:10:20] the git checkout is in /srv/git/netbox_dns_snippets/ [19:10:20] the change I just reviewed turns on running deploy-check, which puts things in /etc/gdnsd/ [19:10:43] I thought [19:10:59] 10Traffic, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [19:11:39] https://gerrit.wikimedia.org/r/c/operations/dns/+/569340 [19:11:42] this one ^^^ [19:12:22] oh right, you mean authdns-update has already been putting it there [19:12:31] yes [19:12:35] I was forgetting that bit, that makes sense [19:12:59] so really, the first run of the sre.dns.netbox cookbook should be basically a no-op [19:13:27] apart maybe updating some of the netbox zone snippets if anything has changed in netbox in the meanwhile [19:13:30] but yes [19:13:40] for the generated zones that gdnsd loads [19:14:31] btw, does gdnsd load only files in zones/* without recursving [19:14:32] ? [19:14:40] correct [19:14:53] it only looks at regular files that don't start with a dot [19:15:02] great, so anything under netbox/ will never be even looked [19:15:07] leaving subdirs to be used for things like templates and includes [19:15:11] k [19:30:26] technically, we can run authdns-update with no pending merges to pull in new netbox stuff, too [19:30:52] but the cookbook is the better path towards eventual automation, so it doesn't confusingly pull in ops/dns changes and/or have a manual prompt about them [19:31:35] yeah I'd start separate to avoid any side effects, we're sure that the netbox one doesn't pull new dns changes [19:31:51] the only case we have to solve is when we'll need both at the same time [19:32:18] i.e. we replace the ulsfo mgmt origin with an include but have to leave one record manually crafted because $reasons [19:32:31] at some point we manage to be able to manage that one record too from netbox [19:33:08] and given that the include is already in place to avoid duplicate records we need to deploy at the same time the removal of the manual one and the update of the netbox one [19:33:34] I don't expect such cases now at the start but soemthing to keep in mind [19:33:41] yeah [19:34:16] the "easY" answer for now is flip a switch to turn off the auto-pushes from netbox (which don't exist yet anyways, I think), change netbox and ops/dns, and run authdns-update manually to pull both changes together. [19:35:03] the sooner we get rid of the exceptions the better heh [19:35:49] the other easy paths are just to let one side temporarily fail an update [19:36:01] either let netbox complain of a failure to deploy then run authdns-update to fix [19:36:35] or run authdns-update with some kind of force-even-though-reload-failed flag, and then do the netbox change which will apply successfully [19:36:39] would it fail? [19:36:56] it would succeed in pushing the data, but the reload command would return bad status [19:37:07] for 2 duplicated identical records [19:37:12] yes [19:37:16] ok [19:37:19] well, depends on the record I guess [19:37:24] now that I think about it, maybe it wouldn't [19:37:29] lol [19:37:46] sorry, dinner's ready, have to step out, will read in a bit [19:37:54] if it's just A/PTR, it might let it fly. PTR might warn, and we might have a strictness flag that turns warns into fails [19:37:58] ok, ttyl [19:39:27] ok, we do have zones_strict_data = true [19:39:35] but yeah, ttyl, sorry [20:15:28] * volans back [21:00:52] 10netops, 10Operations, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [21:19:44] 10netops, 10Operations: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) Deployed on cr2-eqsin: `cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0 Status information FPC Slot: 0 IPV4 ex...