[05:52:07] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) @razzi @Ottomata I am ok with the procedure, nice work! For stats.wikimedia.org we can do as I suggested, namely having envoy on 8443 and change settings for ATS. It is su... [05:53:49] good morning :) [07:01:14] Good morning! [07:01:58] o/ [07:02:13] I found another occurrence of hue graph acting weirdly [07:02:19] :s [07:03:05] so the cassandra workflows have bits like [07:03:23] [07:03:23] [07:03:23] cassandra.output.keyspace [07:03:23] "${cassandra_keyspace}" [07:03:24] [07:03:36] now the " in the value break when translating to json [07:03:43] this is why it fails to load [07:04:23] right - But we need those, as we wnat the value explictely quoted :() [07:05:14] oh yes I am wondering is single quotes would be enough [07:05:32] otherwise I can check in Hue's code if there is a way to apply a workaround [07:05:54] *if single [07:05:54] elukey: I think CQL expects double quotes [07:07:14] never a joy : [07:07:15] :D [07:07:25] :S [07:11:07] I have an idea though, testing something [07:11:14] sure elukey [07:12:27] elukey: Checking the cassandra graph after tonights failure - It is weird that the number of tasks pile up in only 2 hosts :( [07:15:20] yeah indeed, do we round robin? [07:16:05] elukey: normally the job sends the row to it's designed owner using table key [07:18:31] joal: is the owner one of the three replicas? I mean, always the same? [07:20:33] elukey: we use cassandra algo to define which host is the owner of any row based on its key [07:20:41] And then cassandra handles the replication [07:23:08] joal: yep but IIRC you can send a write to either of the replicas for the same key right? What I mean is hammering a single instance vs spreading the write load to multiple replicas evenly [07:23:12] (not sure if possible) [07:23:40] there is also the possibility of increasing the write concurrency, it is 32 now, but docs suggest something like 8x#cores [07:27:48] elukey: I'm not sure if we can spread the write load [07:28:02] Increasing write concurrentcy seems relatively easy [07:28:42] yep I agree [07:32:50] I still wonder how we end up saturating 2 hosts but not the other :( [07:35:31] it may depends on how data is distributed [07:38:11] it does for sure elukey, but still, there are 2/3 orders of magnitude difference in term of pendings tasks between busy-not busy nodes - it feels wrong [07:41:20] I know I know [07:41:26]  [07:49:41] * elukey bbiab [07:53:34] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Amire80) One thing that seems to be missed here is that it's not really that important how many articles in the Slovak la... [08:06:19] I am going to reboot stat1005/stat1008 as planned [08:33:02] ao a tutti! [08:33:15] ... where di half my "Ciao" go? [08:38:16] :D [08:38:20] Hello klausman :) [08:39:20] One needs to remember to kinit after host reboot :) [08:40:15] elukey: I'm ready to decom some machinery [08:42:18] * joal is using a lot of resource on the cluster - monitoring along as well, but please let me know of any concern [08:50:27] klausman: sure! Do you want to file a code change for it or should I do it, and then you execute the commands? [08:52:12] You file, I command :) [08:54:18] ack [08:54:40] so let's check our dear https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1 [08:54:59] under replicated blocks are 0, so we are good [08:56:07] yaaay [08:57:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/634474 [08:58:18] ah no wait [08:58:29] I am missing a file :D [08:59:09] sorry CR updated klausman [09:00:15] gerrit is very slow for me today [09:01:14] same thing [09:01:38] ok, all reviewed [09:02:29] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning [09:02:42] it is not perfectly up to date, but it contains all [09:02:48] if you want we can use cumin! [09:03:48] The editing of hosts.exclude is taken care of by puppet, right? [09:04:08] exactly, this is the outdated part [09:05:00] we can pair on meet if you want [09:05:12] Yeah let's do that [09:05:15] (if so gimme 5 mins for a quick coffee) [09:05:24] Sure, caffeine has priority [09:05:34] danke :) [09:12:38] ok done! [09:31:00] 10Analytics-Clusters, 10Operations, 10Traffic, 10Patch-For-Review: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) 05Open→03Resolved All varnishkafka instances restarted with 6.0.6-1wm2, CPU usage [[https://grafana.wikimedia.org/d/000000253/varnishkafka?viewPan... [10:39:00] * elukey afk! lunch! [12:48:14] wow turnilo released the new version with Dan's commit! [12:48:55] He famous now [12:50:02] well other than that, the new version has support for url shortening via proxy [12:50:12] https://phabricator.wikimedia.org/T233336 [12:51:52] I am going to create the new release and deploy it on the staging instance [12:52:23] That ticket's initial text is A-grade Ernest Hemingway [12:54:45] straight to the point [12:58:27] (03PS1) 10Elukey: Upgrade to upstream version 1.27.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/634500 (https://phabricator.wikimedia.org/T233336) [12:58:45] klausman: basically this is what we do --^ [12:59:00] docker with a buster image, rm -rf node_modules && npm install [12:59:04] that's it [12:59:12] we basically rely on the version on npm [13:00:17] Not great, but better that tar xf;configure;make; make install [13:00:47] well it is a way to freeze the deps without doing it live in production, npm is always a mess [13:01:59] then what I do is: [13:02:16] 1) ssh to deploy1001 and go into /srv/deployment/analytics/turnilog/deploy [13:02:24] checkout a new branch and cherry pick the change [13:02:34] (that was 2) [13:02:38] 3) scap deploy --limit an-tool1005.eqiad.wmnet --no-log [13:02:45] that is our testing instance [13:03:34] then in theory ssh -L 9091:localhost:9091 an-tool1005.eqiad.wmnet should be testable [13:03:59] yep! [13:05:14] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) Version 1.27.0 was released and it contains Dan's pull request, already deployed it on the testing instance: 1) ssh -L 9091:localhost:9091 an-tool1005.eqiad.wmnet 2) browse localhost:9091 [13:19:54] heya teammmm [13:19:57] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) @Milimetric when you have a moment can you update https://gerrit.wikimedia.org/r/622600 ? I'll apply it manually on an-tool1005 :) [13:21:31] holaaa [13:21:34] * elukey bbiab [13:23:41] (03CR) 10Mforns: [V: 03+2] "@Ottomata, I can not see the nit that you mentioned. Maybe your comment was not saved?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [13:37:25] elukey: what about Andrew's comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/622600/2/modules/turnilo/templates/config.yaml.erb [13:40:32] milimetric: good morning :) we need the proxy, the vlan firewall doesn't allow us to contact api-rw etc.. [13:40:42] ok [13:42:00] ok elukey, I can test if you're busy but you seemed excited :) [13:42:32] nono I can [13:43:19] Oct 16 13:43:12 an-tool1005 turnilo[27949]: Fatal settings load error: Invalid or unexpected token [13:43:22] Oct 16 13:43:12 an-tool1005 turnilo[27949]: SyntaxError: Invalid or unexpected token [13:43:25] could it be the # comment? [13:45:03] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) Something really AWESOME that I just discovered is https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=ana... [13:46:29] milimetric: yep the comment is no bueno [13:47:04] oh, fixing [13:48:43] so it seems working, I get "Blocked users can't make short URLs." [13:48:58] and it might make sense since we are tunnelling via ssh [13:49:11] (the client ip is localhost) [13:50:31] elukey: if you want you can console.log(context.clientIp) to verify [13:53:12] milimetric: you mean in my dev tools on the browser or elsewhere? [13:53:32] elukey: in the turnilo config, like, for example instead of the comment line [13:53:36] ahhh [13:53:38] sorry okok [13:53:55] (don't be sorry, it's unholy to pass js in yaml :)) [13:54:23] yep ::1 is logged [13:55:22] lemme try with my external ip as xforwarded for [13:55:32] I'd imagine the shortener doesn't like that, maybe hardcode a real-looking ip instead of context.clientIp in the XFF header [13:55:36] yeah :) [13:55:48] I do *not* win the Hemingway award today :( [13:56:41] https://w.wiki/h7C - victory! [13:56:59] \o/ [13:57:26] k, well, not 100% but hopefully when this thing's deployed for real it actually passes the clientIp properly [13:57:35] exactly yes [13:58:52] mmm milimetric, we could try adding the same console.log on turnilo.w.o and see what ip it logs [13:59:14] you need the new version [13:59:22] context is the new parameter I added [13:59:28] ah okok [13:59:48] well we can test the new version, upgrade and worst case we'll have to send another pull request [14:02:55] yep, sounds good to me [14:03:32] now that I think about it, it might not work [14:03:36] in theory we have this config [14:04:09] envoy/nginx (TLS terminator) -> httpd (with cas auth) -> turnilo (nodejs) [14:04:36] so in the last leg, where IIUC the client ip gets collected, we have probably localhost [14:05:41] now my question is - can we get the ip from another http header with the new code? [14:07:43] (basically passing the header through the chain up to nodejs) [14:13:48] elukey: talk in batcave? [14:18:06] milimetric: sure! (I was getting coffee sorry) [15:05:13] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Nuria) >So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain This is ac... [15:27:03] 10Analytics: Create monthly job for canonical pageviews - https://phabricator.wikimedia.org/T265732 (10mforns) [15:35:05] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10razzi) @elukey I like that plan to keep both proxies running and switch ATS to 8443. [16:05:17] razzi: goood morning, let me know when you are ready for some hadoop stuff (test cluster, host decom, etc..) [16:05:47] if you want to skip for today we can also do it on tuesday (monday is packed with meetings) [16:05:53] elukey: give me 5, then cya in bc? Today's good [16:06:33] sure [16:06:42] good timing for a coffee :) [17:25:18] ok people I am logging off, have a good weekend! [17:45:31] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10razzi) [18:35:48] 10Analytics-Radar, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10nettrom_WMF) @egardner : Thanks for the updates and work so far. Thanks also for yo... [18:44:13] 10Analytics-Radar, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10nettrom_WMF) Also, I think storing previous and current state of the filters is a g... [19:05:23] 10Analytics, 10Analytics-Kanban: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10Milimetric) p:05Triage→03High [20:22:09] 10Analytics-Radar, 10Analytics-Wikimetrics, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Archive analytics-wikimetrics (deprecated by Event Metrics) - https://phabricator.wikimedia.org/T219334 (10MarcoAurelio) a:05MarcoAurelio→03None [20:47:55] 10Analytics, 10Analytics-Kanban: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10Milimetric) I fixed the problem [[ https://github.com/milimetric/hue/blob/master/apps/oozie/src/oozie/templates/dashboard/list_oozie_workflow_graph.mako#L85 | with this ]]. I applied it... [21:00:37] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10Milimetric) @nettrom_WMF, yeah, we'll have to wait to re-sanitize, but that'll happen automatically in early... [21:33:16] 10Analytics: Quick data exploration CLI - https://phabricator.wikimedia.org/T265765 (10Milimetric) [21:45:13] (03PS13) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254)