[00:40:08] <wikibugs>	 10Domains, 10Traffic: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10Quiddity)
[07:28:16] <ema>	 cdanis: hi, yeah /check is handled by the healthchecks plugin. Looks like we should limit it to only deal with varnishcheck requests, good catch!
[09:25:10] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10ema)
[09:55:53] <wikibugs>	 10Domains, 10Traffic, 10Operations: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10johanricher)
[10:19:42] <wikibugs>	 10netops, 10Analytics, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) @Ottomata first n00b question - I am trying to think about where to add the netflow schema to the secondary repository, and I have some doubts about the dir structure. Should i...
[10:33:45] <jbond42>	 XioNoX: not sure if you saw this https://groups.google.com/a/wikimedia.org/forum/#!topic/ops-maintenance/plWiJE_eMV0   IC-313592
[11:14:08] <XioNoX>	 jbond42: nop, looks like a maintenance turning bad
[11:14:23] <XioNoX>	 but those links were already in the initial maintenance notification
[11:14:28] <XioNoX>	 so it's all good
[11:21:09] <jbond42>	 XioNoX: ack thanks
[12:15:54] <wikibugs>	 10Traffic, 10Operations: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema)
[12:16:02] <wikibugs>	 10Traffic, 10Operations: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) p:05Triage→03Medium
[12:32:01] <cdanis>	 ema: thanks for writing the task! not sure if you saw it but yesterday I also sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/604148
[12:32:18] <cdanis>	 oh, and now I saw your patch
[12:34:09] <ema>	 cdanis: nope, I had missed yours! I think it's better to just stop using healthchecks.so altogether so that we can match on Host
[12:34:32] <cdanis>	 I think you are right, although I want to discuss the details a bit more after I get my coffee
[12:34:43] <ema>	 for sure, enjoy!
[12:35:11] <cdanis>	 I had been looking at this in the context of https://phabricator.wikimedia.org/T251301 so I was thinking about things like a /from/icinga path :)
[12:35:13] <cdanis>	 but, brb
[12:43:19] <cdanis>	 ema: thinking about it a bit more, it seems strange to me to name the endpoints after the intended sender, rather than the receiver that's handling the healthchecks.  if we named them after the recipient, we could do more interesting things like "send a request to ats-tls and make sure the ats-be selected for the query was healthy"
[12:43:44] <cdanis>	 it also feels a bit strange to have to know that /from/pybal means varnishfe, /check means "whichever ATS", etc
[12:51:02] <vgutierrez>	 cdanis: well.. about the varnishfe one.. the Host header gives some kind of a hint.. "varnishcheck.wikimedia.org"
[12:51:19] <vgutierrez>	 we just need to tune ats-be to behave like that IMHO
[12:52:26] <ema>	 cdanis: I agree with you, naming after the recipient rather than the sender seems to make more sense
[12:52:38] <cdanis>	 vgutierrez: yeah, ema wrote https://gerrit.wikimedia.org/r/c/operations/puppet/+/604305 which I think is a good step, I'm just bikeshedding details now :)
[12:53:34] <vgutierrez>	 hmm that CR seems deprecated compared to the varnish-fe equivalent that ema uploaded this mroning
[12:53:36] <vgutierrez>	 *morning
[12:53:56] <vgutierrez>	 we should restrict the host header to "varnishcheck.wikimedia.org"
[12:54:09] <ema>	 vgutierrez: that's https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604364/
[12:54:29] <ema>	 the whole thing is quite confusing :)
[12:54:30] <vgutierrez>	 yeah.. I mean apply that to https://gerrit.wikimedia.org/r/c/operations/puppet/+/604305  as well
[12:54:45] <vgutierrez>	 604305 is too broad compared to 604364
[12:55:09] <ema>	 vgutierrez: why? checks for ats don't use varnishcheck.wikimedia.org, they use varnishcheck :)
[12:55:18] <vgutierrez>	 duh, ok
[12:55:21] <ema>	 I know, the whole thing is superconfusing
[12:55:47] <cdanis>	 perhaps we want to rename that as well? ofc makes rollout tricky but still seems doable
[12:56:03] <ema>	 yeah
[12:56:08] <cdanis>	 healthcheck.wm.o or trafficcheck.wm.o or something
[12:56:09] <vgutierrez>	 that would make sense on a hypothetical scenario where ats-be uses inbound TLS as well
[12:56:48] <ema>	 I think we should always use a fixed Host header like healthcheck.wm.o, and then distinguish the expected *recipient* on uri path 
[12:57:26] <ema>	 plus we should perhaps not have pybal check varnish-fe via ats-tls? 
[12:57:40] <cdanis>	 +1 to all of that :)
[12:58:17] <vgutierrez>	 ema: well.. ats-tls is useless without varnish-de
[12:58:18] <vgutierrez>	 *fe
[12:58:44] <ema>	 not when it will listen on port 80!
[12:58:45] <vgutierrez>	 I think it's useful to be able to depool based on that combination (varnish-fe being reachable via ats-tls)
[12:59:05] <vgutierrez>	 right.. that would trigger a change on port 80 healthcheck
[12:59:17] <vgutierrez>	 but port 443 would remain the same till we get rid of varnish-fe
[13:00:19] <cdanis>	 maybe pybal should stay checking varnishfe health for now, but I'd still like to see it probing healthcheck.wm.o/varnishfe, instead of needing to know that /from/pybal is handled by varnishfe
[13:00:39] <cdanis>	 if we think there's diagnostic value in knowing the request originator at a glance, we could always add ?from=pybal
[13:01:02] <ema>	 or, wait for it, set User-Agent!
[13:01:20] <cdanis>	 ahah
[13:01:29] <cdanis>	 that's why I said 'at a glance' ;) but yeah
[13:12:04] <ema>	 anyways, I'd like to do this in steps: first actually check for Host/uri at both the ats-be and varnish-fe layer -- https://gerrit.wikimedia.org/r/#/q/topic:T255015+(status:open+OR+status:merged) 
[13:12:17] <stashbot>	 T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015
[13:12:22] <ema>	 and then changing the URIs
[13:12:33] <ema>	 thanks stashbot we know
[13:14:02] <cdanis>	 +1, we'll probably have to support both URIs for a short while as well
[13:31:00] <cdanis>	 vgutierrez: LOL I sent some /from/cdanis requests yesterday, without knowing of yours
[13:31:09] <vgutierrez>	 :)
[13:34:46] <ema>	 I usually add some private message to the gods, like /come/on/do/it or so
[13:36:04] <ema>	 merging with puppet disabled on A:cp because you never know
[13:36:22] <cdanis>	 indeed you never know
[13:36:25] <cdanis>	 ema: mind if I edit the task description with a summary of our discussion?
[13:36:39] <ema>	 cdanis: of course not
[13:37:02] <vgutierrez>	 "you know nothing" - ancient puppet survival guide
[13:38:34] <ema>	 cdanis: ah, and the "X-Cache: bug" thing is also something to fix: we set X-Cache to the cache lookup result, if none we say "bug"
[13:38:43] <ema>	 healthchecks don't lookup the cache, there you go
[13:38:46] <cdanis>	 will note it as well
[13:39:04] <ema>	 we should probably say "int", in line with what varnish does
[13:39:09] <ema>	 (for internal)
[13:42:08] <ema>	 alright, on cp3050 requests for localhost:3128/check with Host != "varnishcheck" now reach the applayer as expected
[13:42:50] <ema>	 varnish-fe still considers cp3050 as healthy, also good
[13:43:59] <vgutierrez>	 good
[13:45:20] <cdanis>	 nice
[13:46:05] <ema>	 the ats-instance-drilldown dashboard looks like nothing obvious exploded
[13:47:00] <ema>	 re-enabling puppet, will begin a rolling ats-be restart in a few minutes
[13:47:20] <vgutierrez>	 we should restart ats-tls as well
[13:47:30] <vgutierrez>	 to get rid of the healthcheck plugin there as well
[13:47:37] <wikibugs>	 10Traffic, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10Andrew)
[13:48:42] <vgutierrez>	 right now https://en.wikipedia.org/check is being blocked by ats-tls, not ats-be
[13:49:57] <ema>	 indeed
[13:50:21] <ema>	 vgutierrez: ok to ats-tls-restart on cp3050?
[13:50:27] <vgutierrez>	 go ahead :D
[13:50:38] <vgutierrez>	 I love how you do tests on a 8k rps instance ;P
[13:51:12] <ema>	 if it works there, it works everywhere :)
[13:51:33] <vgutierrez>	 indeed
[13:53:58] <ema>	 vgutierrez: looks fine
[13:54:13] <vgutierrez>	 <3 perfect
[13:54:25] <ema>	 curl -v -k -H "Host: en.wikipedia.org" https://localhost/check reaches the applayer if you try enough time to c-hashed to cp3050 as a backend :)
[13:54:32] <ema>	 s/time/times/
[13:58:57] <wikibugs>	 10netops, 10Analytics, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Yeah we don't have a great convention for namepacing.  For analytics/instrumentation schemas, we decided to keep things simple and keep the hierarchy mostly flat, e.g. analyt...
[14:07:21] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10CDanis)
[14:07:22] <ema>	 vgutierrez: ok to proceed with rolling ats-tls/ats-be restarts?
[14:09:40] <vgutierrez>	 +1
[14:13:03] <wikibugs>	 10netops, 10Analytics, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) @ayounsi any suggestion? `netflow/flow/something` ?
[14:13:31] <vgutierrez>	 ema: ATS is segfaulting on cp3056
[14:13:43] <ema>	 uh
[14:14:09] <vgutierrez>	 Jun 10 14:11:35 cp3056 traffic_server[12825]: Fatal: couldn't allocate 110592 bytes at alignment 4096 - insufficient memory
[14:14:20] <vgutierrez>	 lua suffering under varnish-fe stress?
[14:14:40] <cdanis>	 uhh
[14:15:01] <ema>	 interrupting the restarts
[14:16:51] <ema>	 so both cp3052 and cp3056 seem to have issues
[14:17:21] <vgutierrez>	 3062 as well
[14:17:40] <ema>	 3056 is up now
[14:17:58] <vgutierrez>	 you sure?
[14:18:03] <vgutierrez>	 journalctl shows a stacktrace to me
[14:18:05] <ema>	 sorry, 3052 I meant
[14:19:20] <vgutierrez>	 so ATS is crashing upon reload
[14:19:32] <vgutierrez>	 before restarting it, right?
[14:20:06] <vgutierrez>	 so maybe the issue is related to unloading healthchecks.so?
[14:20:50] <ema>	 not upon reload though, after a significant time
[14:20:55] <ema>	 Jun 10 13:55:20 cp3052 systemd[1]: Reloaded Apache Traffic Server is a fast, scalable and extensible caching proxy server..
[14:21:01] <ema>	 Jun 10 14:11:11 cp3052 traffic_manager[23642]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory
[14:21:13] <vgutierrez>	 hmm right
[14:21:22] <vgutierrez>	 but you were missing a restart in the middle of that
[14:21:22] <vgutierrez>	 right?
[14:21:38] <vgutierrez>	 I mean.. run-puppet-agent && ats-backend-restart would show a different log
[14:21:45] <ema>	 I'm gonna restart the trafficservers that are in trouble now
[14:23:32] <ema>	 yeah, 3052 for instance wasn't restarted till 14:15:22
[14:24:10] <vgutierrez>	 cp1083 is crying now, I'd trigger a restart of the already updated instances ASAP
[14:24:25] <ema>	 well they've all been updated
[14:24:50] <vgutierrez>	 so you've triggered a puppet run everywhere and you were going to restart them afterwards?
[14:25:05] <ema>	 triggered a puppet run a very long time ago, no problem
[14:25:15] <ema>	 retarted ats-be on cp3050, no problem
[14:25:27] <vgutierrez>	 gotcha
[14:25:28] <ema>	 started rolling restarts, problems (on other instances, not those being restarted)
[14:25:57] <vgutierrez>	 hmmm depooling/repooling ats-be instances triggers an increase of healthchecks from varnish-fe to ats-be, right?
[14:26:05] <ema>	 definitely, yes
[14:26:33] <ema>	 I wonder if we should instead restart ats-be without depool
[14:27:33] <ema>	 will do so for the instances now red in icinga
[14:27:49] <vgutierrez>	 ack
[14:30:27] <ema>	 the restarts take a while but seem to clear things
[14:30:52] <vgutierrez>	 I'm wondering why traffic_manager fails to respawn traffic_server BTW
[14:33:46] <ema>	 I see recoveries
[14:33:55] <vgutierrez>	 yup
[14:38:24] <ema>	 any simple way to say "restart the sercice if you haven't already today"? :)
[14:40:53] <ema>	 vgutierrez: or maybe just another unconditional rolling restart without depool, what do you think?
[14:41:25] <vgutierrez>	 hmmm
[14:41:33] <ema>	 the state currently is: I have stopped the first rolling restart when I saw the first crashes, but now it's clear that crashing services are those that haven't been restarted
[14:41:48] <ema>	 so we need a restart ASAP, but we're afraid that depooling might cause issues
[14:42:03] <cdanis>	 ema: [[ $(date +%s -d'12 hours ago') > $(date -d"$(systemctl -p ExecMainStartTimestamp show FOOBAR.service  | cut -f2 -d=)" +%s) ]] || systemctl restart ...
[14:42:04] <ema>	 could also be that the depool/healthcheck increases is unrelated to the crashes really
[14:42:41] <vgutierrez>	 I hope it's just a one-off thing, so we need to restart them once and that's it
[14:42:43] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey)
[14:43:22] <vgutierrez>	 ema: I'd restart them without depooling.. is gonna be way faster
[14:43:27] <ema>	 vgutierrez: ok
[14:43:35] <vgutierrez>	 and we've a big chunk of them that are going to crash
[14:44:38] <ema>	 rolling systemctl restart began
[14:46:35] <vgutierrez>	 so far 25/72 instances crashed
[14:47:44] <ema>	 vgutierrez: but nothing crashed after restart, right?
[14:47:45] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Ya if possible, the schema should be named and modeled after what the event represents.  In this case it sounds like it is something like a 'network co...
[14:47:56] <vgutierrez>	 ema: nothing crashed twice
[14:48:31] <ema>	 right
[14:53:35] <ema>	 ok, all backend restarts done
[14:54:03] <vgutierrez>	 2009 and 5004 on icinga is just icinga being slow?
[14:54:17] <vgutierrez>	 *2029
[14:54:26] <vgutierrez>	 2029 just recovered
[14:54:31] <ema>	 yeah 
[14:54:58] <ema>	 cp5004 should recover shortly too
[14:56:01] <ema>	 alright, that wasn't so smooth!
[14:56:45] <vgutierrez>	 well.. cp5004 is actually depooled ema
[14:56:50] <vgutierrez>	 {"cp5004.eqsin.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=eqsin,cluster=cache_upload,service=ats-be"}
[14:57:07] <vgutierrez>	 I guess aborting the restart triggered that
[14:57:15] <ema>	 y
[14:57:20] <ema>	 repooling
[14:57:30] <vgutierrez>	 this was a bumpy ride indeed
[14:58:13] <ema>	 vgutierrez: and I haven't done ats-tls yet
[14:58:23] <vgutierrez>	 hmmm
[14:58:45] <vgutierrez>	 maybe you wanna go first with ulsfo and codfw ;)
[14:59:05] <vgutierrez>	 but it's a quite different scenario
[14:59:10] <vgutierrez>	 healthchecks.so wasn't being used there
[14:59:23] <ema>	 sure, we can start with ulsfo
[14:59:25] <cdanis>	 wasn't it still loaded there?
[14:59:29] <vgutierrez>	 yes
[15:00:43] <ema>	 also ats-tls-restart doesn't trigger any vcl reloads so this really should be fine
[15:01:23] <vgutierrez>	 hmmm https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?panelId=75&fullscreen&orgId=1&from=now-3h&to=now&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend
[15:01:33] <vgutierrez>	 it looks like lua's been stressed a little bit :)
[15:03:23] <vgutierrez>	 ema: also... https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&from=now-24h&to=now&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend&fullscreen&panelId=73
[15:03:45] <vgutierrez>	 ema: our healthcheck lua implementation triggers some memory allocations on the global lua plugin
[15:04:02] <ema>	 ah so we do get global stats too!
[15:04:42] <ema>	 vgutierrez: I'm not sure the Lua VM threads graph has anything to do with this, those are remap plugins right?
[15:05:00] <vgutierrez>	 click on the global one please
[15:05:45] <ema>	 I've got it open :)
[15:05:59] <vgutierrez>	 so the graph shows global and remap threads
[15:06:09] <vgutierrez>	 if you don't hide the remap threads.. the global one looks like 0
[15:06:26] <vgutierrez>	 but it's definitely a spike there for global after the healthcheck switch
[15:06:58] <ema>	 ah I see what you mean, yep!
[15:18:14] <ema>	 vgutierrez: meanwhile ats-tls restarts in ulsfo are proceeding uneventuflly
[15:18:20] <vgutierrez>	 lovely
[15:18:27] <vgutierrez>	 for once I get the easy path
[15:18:29] <vgutierrez>	 <3
[15:20:19] <vgutierrez>	 BTW and probably related https://github.com/apache/trafficserver/issues/4562#issuecomment-642065330
[15:23:09] <ema>	 vgutierrez: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?panelId=81&fullscreen&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3055
[15:23:43] <ema>	 from 7G to 50+
[15:23:46] <vgutierrez>	 that's a teeny tiny spike
[15:24:22] <vgutierrez>	 the only instance that's missing the spike is cp3050
[15:24:33] <ema>	 which is the one I restarted almost immediately
[15:24:55] <vgutierrez>	 so I think it's better to link the puppet run and the restart :)
[15:25:05] <ema>	 sure
[15:25:05] <vgutierrez>	 that's what I usually do
[15:25:17] <ema>	 even better not to have memory leaks on config reload
[15:25:22] <ema>	 but yes :)
[15:25:24] <cdanis>	 please ema let's be reasonable
[15:25:44] <vgutierrez>	 ema can be picky sometimes
[15:25:49] <vgutierrez>	 ;P
[15:26:21] <ema>	 and yet I know that low expectations is the key to happiness
[15:26:39] <vgutierrez>	 BTW.. that's pretty interesting cause ats-tls is being hit by traffic_ctl reload at least 2 times a day
[15:26:54] <cdanis>	 it's possible it's just some configurations that trigger the leak?
[15:27:19] <ema>	 the big difference between ats-tls and ats-be is that the latter's healthchecks.so gets thousands of reqs per second
[15:27:21] <vgutierrez>	 well.. we also know for a fact that tslua doesn't like to be reloaded
[15:27:22] <ema>	 the former 0
[15:28:08] <ema>	 and what vgutierrez says
[15:28:18] <cdanis>	 mmm
[15:30:05] <vgutierrez>	 ema: can we trigger a few ats-be depool+restarts?
[15:30:12] <vgutierrez>	 and see how they behave?
[15:30:23] <ema>	 wait, first: ok to finish ats-tls restarts?
[15:30:26] <ema>	 I've done ulsfo only so far
[15:32:33] <vgutierrez>	 yep
[15:32:36] <vgutierrez>	 go ahead please
[15:33:08] <ema>	 started
[15:33:43] <ema>	 vgutierrez: you mean you want to try some ats-backend-restarts and see what happens to the trillion of healthchecks? :)
[15:33:52] <vgutierrez>	 indeed
[15:34:04] <vgutierrez>	 if ats-be with tslua based healthchecks can't handle that
[15:34:09] <vgutierrez>	 we need to revert the change
[15:34:58] <ema>	 sounds good but let's wait for tomorrow
[15:35:11] <ema>	 serviceops is now merging a thing that will result in 2x purges being sent for a bit
[15:36:08] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) `netflow/flow/record` or `netflow/flow/observe` could be ok?
[15:36:35] <vgutierrez>	 ema: ack
[15:51:41] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) I'd guess I'd still ask what is a "netflow"?  Or a "netflow/flow"?  I guess if you could defined 'netflow' as a noun in the description of the schema,...
[15:58:31] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) observe could be ok too, sorry didn't mean to make observation sound better or worse.  netflow/observe event sounds a little weird but does seem consis...
[15:59:17] <cdanis>	 btw ema when you get a minute please make sure my edits to T255015 captured everything
[15:59:18] <stashbot>	 T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015
[15:59:21] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) ALSO these are just ideas and thoughts!  Schemas in secondary repo SHOULD require less bikeshedding than those in primary :)
[16:01:14] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) Netflow in theory is the name of the technology/protocol (see https://tools.ietf.org/html/rfc3954), and IIUC it defines a "flow" as the bytes/packets exc...
[16:01:23] <ema>	 cdanis: will do, thanks!
[16:12:09] <wikibugs>	 10Traffic, 10Operations, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle)
[16:13:00] <wikibugs>	 10Traffic, 10Operations, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) I've updated the task descriptoin to include: * the grandfathering of action=rollback (as agreed a few years ago)...
[16:13:42] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Ah, makes more sense!  Great.  If pmacct is aggregating, perhaps summary is good in the name?
[16:38:06] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema)
[16:39:21] <ema>	 cdanis: lg :)
[16:41:49] <wikibugs>	 10Traffic, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.37; 2020-06-16): Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10Pchelolo) Big wikis are now done, but there's still a bit of long tail work left - Convert beta cluster to...
[16:44:45] <cdanis>	 thanks ema!
[17:05:08] <wikibugs>	 10Acme-chief, 10Patch-For-Review, 10cloud-services-team (Kanban): tools/toolsbeta: improve acme-chief integration - https://phabricator.wikimedia.org/T252762 (10Andrew) acme-chief is set up and working in toolsbeta now.  I haven't actually consumed any of the certs or  thought about what certs we need (right...
[17:07:39] <wikibugs>	 10netops, 10Operations: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10ayounsi) 05Open→03Resolved All done!
[17:21:26] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Nuria) analytics/network/netflow/flowset?  analytics/netflow/flowset?   Flowset is how an "event" is called on the protocol linked by @elukey above
[17:24:46] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) I don't think this needs to go in 'analytics', but flowset sounds nice if it is accurate.
[17:27:10] <wikibugs>	 10netops, 10Analytics, 10Analytics-Cluster, 10Operations: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Nuria)  network/netflow/flowset?
[17:43:16] <ottomata>	 Q
[17:43:40] <ottomata>	 why do LVS service ports need to vary if their domain names already vary
[17:43:41] <ottomata>	 ?
[17:43:51] <ottomata>	 i gues LVS doesn't do name based routing?  
[17:44:33] <ottomata>	 it'd be really handy to e.g. address eventgate-main just with either eventgate-main.discovery.wmnet or eventgate-main.svc.eqiad.wmnet
[17:49:08] <cdanis>	 ottomata: LVS does not, LVS doesn't get to look at that at all AIUI
[17:50:29] <ottomata>	 ii guess just because its lower level, no knowledge of domains?
[18:06:19] <cdanis>	 LVS doesn't get to inspect payloads at all; I believe all it sees is the src/dst * IPs/ports
[18:09:52] <ottomata>	 aye too bad
[18:10:11] <ottomata>	 what is doing the frontend translation?  ats?
[18:10:26] <ottomata>	 e.g. for stream.wm.org -> stream.discovery.wmnet:8092
[18:31:44] <volans>	 ottomata: yes, see hieradata/common/profile/trafficserver/backend.yaml
[18:31:49] <volans>	 target: http://stream.wikimedia.org
[18:31:50] <volans>	 replacement: https://eventstreams.discovery.wmnet:4892
[18:32:03] <volans>	 LVS is layer 4 only
[18:32:08] <volans>	 (on TCP)
[18:33:29] <volans>	 then discovery does its "magic"  (see https://wikitech.wikimedia.org/wiki/DNS/Discovery)
[18:35:13] <volans>	 that will return an *.svc.* address, where there is an LVS that will route the traffic to one of the backends for that service
[19:01:09] <ottomata>	 right.
[19:01:19] <ottomata>	 can we make ats endpoints for internal things only?
[19:01:26] <ottomata>	 to avoid having to hardcode ports everywhere?
[19:01:30] <ottomata>	 volans:  ^?
[19:07:17] <volans>	 don't ask me :)
[19:07:44] <cdanis>	 ottomata: I think the thing you actually want is Envoy with its config derived from the service catalog, but ask _joe_ 
[19:07:45] <volans>	 but I'd say that doesn't make too much sense
[19:08:09] <volans>	 (reply was to the ATS above, not cdanis's one)
[19:08:22] <_joe_>	 what's the Y here?
[19:08:30] <_joe_>	 the Y problem I mean
[19:09:08] <volans>	 connect to foo.discovery.wmnet without having to hardcode the port in the client apparently :)
[19:09:24] <_joe_>	 no no
[19:09:27] <volans>	 basically connect to foo.w.o but internally
[19:09:36] <_joe_>	 I mean, what is that otto really wanted to do
[19:09:50] <_joe_>	 not whatever abstraction was formulated here
[19:10:03] * volans no idea
[19:10:14] <_joe_>	 yeah I'm asking ottomata :P
[19:28:30] <ottomata>	 yeah
[19:28:50] <ottomata>	 right now we're tryingi to emit canary events into all streams
[19:28:54] <ottomata>	 (or, msot of them)
[19:29:00] <ottomata>	 so we can do better monitoring
[19:29:08] <ottomata>	 to do that, we need to know where to POST the events
[19:29:28] <ottomata>	 and we need to post them into each datacenter eventgate instance too
[19:29:41] <ottomata>	 right now there are 4 eg services in each datacenter
[19:29:46] <_joe_>	 ook where you post them from?
[19:29:56] <ottomata>	 unknown yet, but likely from in analytics cluster
[19:30:50] <ottomata>	 so we get the list of streams from stream config, get their event service name (as configured in mediawiki config), and tthen need to map from that to the final service urls to post the event to
[19:30:58] <ottomata>	 so basically i'm starting with 
[19:30:59] <ottomata>	 eventgate-main
[19:31:06] <ottomata>	 and I need to know that maps to
[19:31:18] <_joe_>	 you have all that in service::catalog in puppet :)
[19:31:24] <ottomata>	 eventgate-main.svc.eqiad.wmnet:4492 AND eventgate-main.svc.codfw.wmnet:4492
[19:31:29] <_joe_>	 actually, we should have a function to get a single service url
[19:31:35] <ottomata>	 oh ya?
[19:31:45] <_joe_>	 no I mean we need to write it :D
[19:31:59] <_joe_>	 wmflib::service::get_url($service, $site)
[19:32:09] <_joe_>	 so you can use that in puppet
[19:32:27] <_joe_>	 other than that, yes envoy will standardize stuff but it's probably not the best choice for this
[19:32:47] <ottomata>	 ooo
[19:32:48] <ottomata>	 looking
[19:33:16] <_joe_>	 ottomata: wmflib::service::fetch() gets all the data in service::catalog into a puppet data structure for you
[19:33:34] <_joe_>	 you can then lookup the service you want and build the url from the data in the structure
[19:33:52] <_joe_>	 so if $service['encryption'] => 'https'
[19:33:54] <_joe_>	 and so on
[19:33:57] <ottomata>	 not seeting get_url
[19:34:08] <_joe_>	 21:31:44	<_joe_>	no I mean we need to write it :D
[19:34:12] <ottomata>	 oh haha
[19:34:14] <ottomata>	 oh sorry
[19:34:15] <ottomata>	 :)
[19:34:16] <ottomata>	 ok
[19:34:20] <_joe_>	 :P
[19:35:42] <ottomata>	 coool
[19:35:48] <ottomata>	 ok nice