[01:57:53] 10Quarry: In the Quary tool the download as excel creates empty files - https://phabricator.wikimedia.org/T261909 (10Jarekt) [05:47:35] 10Quarry: In the Quary tool the download as excel creates empty files - https://phabricator.wikimedia.org/T261909 (10Aklapper) Similar case recently was {T257453} [06:40:46] good morning [06:49:01] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) @Milimetric +1 on the patch, it seems a good use case that others can benefit. Upgrading turnilo to a new version is very easy once they commit + release the new one :) [07:14:04] 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10elukey) Reporting the SAL entries that we mistakenly logged to another task: Mentioned in SAL (#wikimedia-operations) [2020-09-02T14:28:58Z] execute kafka topics --... [07:14:25] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10elukey) The above SAL entries are of course wrong :) [07:34:06] brb [07:49:51] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ayounsi) a:05ayounsi→03Cmjohnson >>! In T259071#6430943, @Cmjohnson wrote: > @ayounsi can you add the analytics vlan to cloud... [08:00:40] Hi team [08:02:32] bojour [08:02:35] *bonjour [08:02:42] Bonjour elukey [08:17:34] Hue is so frustrating [08:17:53] they want to keep compatibility with py2 for some absurd reason [08:18:24] and they didn't test properly paths like ldap, kerberos, etc.. [08:18:41] so the "main" app doesn't return horrors, but if you have specific auth needs bum [08:19:01] :( [08:22:19] * elukey coffee [09:09:14] joal: I am going to make your inner data nerdism happy with two links, ready? :D [09:09:31] \o/ ! [09:09:43] https://www.apachecon.com/acah2020/tracks/bigdata-1.html and https://www.apachecon.com/acah2020/tracks/bigdata-2.html [09:10:10] Holly molly! [09:10:20] * joal 's prodictivity drops blatantly [09:10:29] ahhahaha [09:10:56] "Next Gen Data Lakes using Apache Hudi" [09:11:05] sounds something interesting :P [09:11:10] "Hadoop Storage Reloaded: the 5 lessons Ozone learned from HDFS" as well [09:11:18] "Building efficient and reliable data lakes with Apache Iceberg" [09:11:42] Apache Hadoop YARN: Past, Now and Future is also nice [09:11:52] elukey: we should register and attend [09:12:17] timing wise it is all happening late evening time, doable but not great [09:12:27] it is apachecon NA so it cannot be better [09:12:30] Let's ask 'The Boss'™ [09:12:50] lol max cost is a donation of 50 dollars [09:13:16] I am sending an email to internal [09:13:24] elukey: "Apache Hadoop YARN fs2cs: Converting Fair Scheduler to Capacity Scheduler" [09:13:34] "HDFS Migration from 2.7 to 3.3 and enabling Router Based Federation (RBF) in production" [09:13:43] ok I'm fully in - We should go [09:14:49] track 2 is also of interest "Build a reliable, efficient and easy-to-use service for Apache Spark at Uber's scale" [09:15:40] "Secure your Big Data Cloud cluster with SDX (Ranger, Atlas, Knox, HMS)" [09:16:34] "Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service" <-- elukey, would that ring a bell? [09:17:42] wow [09:17:55] I didn't see the HDFS migration! [09:18:48] elukey: I'm now not only into asking, but more into convincing 'the Boss'™ that we should go (possibly more than us two) [09:19:55] yes it is definitely good [09:20:11] I mean, it is basisally work that we are doing or that we'll do soon [09:20:55] elukey: I'd put it as, it's talks about work we're doing NOW (hudo/iceberg, hdfs migration, shuffle problems) [09:22:00] I also think that Andrew would like to follow the Flink talk [09:22:09] Very possible [09:28:07] ok, I've made my list - there are 16 talks I'd like to attend, with 3 conflict (on Sept 29th) [09:35:25] Thanks a milion elukey for having enlighten me today with this schedule :) [09:41:17] joal: while I was reading the titles I was also seeing "subtitle: Joseph" [09:41:48] <3 [10:38:19] for when you come online ottomata I have some questions for you - I have found many event duplication in event.mediawiki_revision_create [10:52:27] * elukey lunch! [11:39:16] taking a break until standup (meeting-heavy evening) [11:44:56] 10Analytics-Clusters: hue.wikimedia.org throws an exception when trying to log in with a non-ASCII username - https://phabricator.wikimedia.org/T260929 (10elukey) It is the shell username yes! [11:49:53] 10Analytics, 10Operations: Create analytics-announce@wikimedia.org - https://phabricator.wikimedia.org/T261946 (10elukey) [11:57:53] 10Analytics, 10Operations: Create analytics-announce@wikimedia.org - https://phabricator.wikimedia.org/T261946 (10elukey) Looks like it worked :) https://lists.wikimedia.org/mailman/listinfo/analytics-announce [11:58:54] I just got a call from my landlord: al electrician will show up "sometime later today, around 4 or 5", so I might be without power for a while then. I'll try to attend meetings as I can, but might have to drop out or join late [12:00:32] ack! [12:02:14] I should really set up my WMF Google account on my phone [13:34:39] created https://lists.wikimedia.org/mailman/listinfo/analytics-announce if anybody wants to subscribe :) [13:34:45] I'll send later on a test msg [13:41:55] 10Analytics, 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) a:03Ottomata [13:42:06] * elukey coffee [13:42:33] subscribed, nice! [13:56:10] !log rerun edit-hourly-druid-wf-2020-08 after failed attempt [13:56:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:56:31] !log Kill-restart mediawiki-history-reduced oozie job into production queue [13:56:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:57:27] joal: I'm working on that! [13:57:54] milimetric: bad habits - sorry :( [13:58:06] I reran it and it failed again, but now I don't know if you reran it at the same time [13:58:08] milimetric: I'll make an effort to not take actions [13:58:12] meh [13:58:21] it might have refailed! [13:58:26] I could log it before, I was gonna log after [13:58:40] yup - failed 3rd time [13:58:43] I don't see anything obviously wrong with the data and of course the indexation log is super unhelpful "indexation failed" [13:58:46] "check engine" [13:59:17] !log edit-hourly-druid-wf-2020-08 fails consistently [13:59:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:59:27] milimetric: it seems there isn't even a map-reduce indexation job :( [13:59:36] ? [13:59:43] milimetric: you're gonna need to look at indexation-task jobs on druid servers [14:00:02] yeah, that's what I was gonna do next, trying to see if there's docs / or if I have to write them [14:00:25] milimetric: druid indexation is launched by oozie onto a druid machine, whose job is to launch 2 map-reduce jobs for index files to be prepared on HDFS [14:07:55] 10Analytics, 10Analytics-Kanban, 10good first task: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171 (10paulkernfeld) @Nuria thanks, I will check that one out. [14:14:15] 10Quarry: Downloading Excel XLSX files creates file with 0 bytes - https://phabricator.wikimedia.org/T261958 (10Jinoytommanjaly) [14:17:18] * elukey is spamming the team [14:29:10] has anybody got my email? [14:29:17] milimetric: the druid admin ui will point you to indexation jobs if they are happening : https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Indexing_Logs [14:29:27] elukey: yes [14:29:37] elukey: got it, thanks for setting that up [14:29:42] nuria: yeah, I'm writing some docs as I go [14:34:56] nuria: goood thanks! [14:54:15] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Product-Infrastructure-Data: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata) [14:57:08] 10Quarry: Downloading Excel XLSX files creates file with 0 bytes - https://phabricator.wikimedia.org/T261958 (10Mainframe98) [14:57:21] 10Quarry: In the Quary tool the download as excel creates empty files - https://phabricator.wikimedia.org/T261909 (10Mainframe98) [15:01:35] (oof, meeting running late, brt [15:01:36] ) [15:01:49] ping ottomata milimetric [15:01:56] (brt) [15:01:57] OH my [15:04:15] 10Analytics, 10Operations: Create analytics-announce@wikimedia.org - https://phabricator.wikimedia.org/T261946 (10elukey) 05Open→03Resolved [15:04:17] 10Analytics, 10Analytics-Kanban: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10elukey) [15:04:37] 10Analytics, 10Analytics-Kanban: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10elukey) Created https://lists.wikimedia.org/mailman/listinfo/analytics-announce :) [15:17:47] 10Analytics, 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#6433346, @Ottomata wrote: > Sure! Although I have to admit I don't know what this means. It already runs env... [15:23:34] elukey: any idea why http://localhost:8081/unified-console.html isn't loading? druid1008 looks like the leader, so I'm doing ssh -N druid1008.eqiad.wmnet -L 8081:localhost:8081 [15:23:42] (saving you from the boring discussion :)) [15:24:48] milimetric: it loads for me (chrome) [15:24:54] same ssh tunnel [15:25:12] * milimetric throws water on keyboard [15:25:56] it loads now!!!! [15:26:01] I HATE EVERYTHING [15:26:54] * joal sends wikilove to milimetric [15:29:40] 10Analytics, 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) OH! yes...there was a reason we left HTTP on...I think it was before MW was using a local envoyproxy to do TLS, because PHP... [16:10:41] elukey: druid interna;l cluster is druid1004(5and 6) ? [16:11:11] nuria: do you mean the analytics cluster? [16:11:22] or the one behind aqs? [16:12:13] if the latter druid1004-8 [16:13:46] elukey: about AQS snapshot bump, I assume we'll do that after meeting? [16:13:59] or tomorrow elukey? [16:14:30] joal: I'd say later on so razzi can participate [16:14:39] works for me elukey [16:14:43] 10Analytics-Clusters, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Move mjolnir kafka daemon from ES to search-loader VMs - https://phabricator.wikimedia.org/T258245 (10Nuria) Any updates on this from the mjonlir work? [16:14:57] elukey: teh analytics one [16:15:00] *the [16:16:06] nuria: so an-druid100[1,2] and druid100[1-3] [16:17:47] elukey: i see, sorry, i shoudl have looked at that in grfana [16:19:17] nuria: np! happy to help, naming is confusing atm.. but eventually, an-druid* will be analytics and druid* public [16:22:05] joal: will explain to razzi later on the process, aqs1004 depooled and ready :) [16:22:21] elukey: ok - I was happy to wait, later meetings tonight [16:22:30] Will test nonetheless elukey [16:23:26] works for me elukey - we can pursue deploy :) [16:24:27] ack starting [16:26:06] joal: done! [16:26:11] checking UI elukey [16:26:39] milimetric: we updated druid since our last indexation of edits_hourly so it can very well be we are missing a java dep, let me ssh and try to find indexation logs [16:28:04] elukey: UI looks good - done on my side :) [16:28:08] thanks a lot elukey [16:28:21] super :) [16:34:19] nuria: ahhh! [16:34:19] java.lang.IllegalArgumentException: Cannot construct instance of `java.lang.Class`, problem: io.druid.data.input.parquet.DruidParquetInputFormat [16:34:47] elukey: feels like a deja-vu :) [16:35:04] yes I know the issue, io.druid. vs org.apache.druid [16:35:19] yes past Luca didn't fix it [16:35:20] sigh [16:36:18] elukey: please don't beat yourself up :) [16:36:38] * joal sends wikilove and apachelove to elukey [16:36:43] (03PS1) 10Elukey: Fix parquet druid configuration for the edits_hourly coordinator [analytics/refinery] - 10https://gerrit.wikimedia.org/r/624117 [16:36:45] joal: I fixed it for reduced, not edits hourly [16:36:46] sigh [16:37:16] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM - Merging before deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/624117 (owner: 10Elukey) [16:38:32] milimetric: --^ [16:38:42] elukey: as refinery needs to be deployed, shall we merge my patch for data-purge? [16:39:44] elukey: I also would like to discuss with you a postential evolution of our data-purging system, allowing to have a single place where all purge is stored (yaml conf file) [16:40:21] joal: ah do you mean https://gerrit.wikimedia.org/r/c/analytics/refinery/+/623586 ? [16:40:43] correct elukey - see my comment (just before andrew last) [16:40:47] joal: mm what do you mean with data purging system? systemd timers? [16:41:13] elukey: the fact that some purging is done with 1 timer per purge, other with 1 timer for multiple datasets [16:41:22] I'd rather have everything done the same way [16:42:58] joal: as you prefer, it is a matter of having a script that is capable of doing that [16:43:11] I am supportive but probably we don't have time now (I think) [16:43:14] Yes elukey [16:45:41] I'll 'make a task'™ [16:45:52] there is always a task [16:45:55] :D [16:46:19] (jokes aside, seems a nice "Tech debt" task to create, to be tackled during one of the next quarters) [16:46:30] yup [16:50:01] milimetric: updated log location to /srv/druid/indexing-logs [16:50:32] thx! [16:52:23] thanks so much yall, looks like I just have to deploy and restart [16:52:52] elukey: ok for the datapurge task to be merged? [16:53:14] milimetric: [16:53:15] "unparseableEvents":null,"rowStats":{},"errorMsg":"java.lang.IllegalArgumentException: Cannot construct instance of `java.lang.Class`, problem: io.druid.data.input.parquet.DruidParquetInputFormat\ [16:53:57] joal: yes yes I was about to say +1 [16:54:00] nuria: yeah, I think that’s what Luca pasted above, it’s fixed in a patch Jo merged, to refinery [16:54:06] (just finished the review) [16:54:12] ack elukey - sorry for the rush :S [16:54:20] joal: you are not excused! :P [16:54:27] :D :D :D [16:54:33] * joal looks down in shame :) [16:54:38] ah sorry, i missed the backscroll [16:55:05] nono, thanks for finding! [16:55:16] elukey: i remember that [16:55:20] elukey: k easy FIX [16:55:57] yeah I forgot that one [16:56:50] milimetric: I updated the ehterpad doc with the refinery deploy (2 things - edit-hourly-druid and data-purge) [17:02:32] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10Nuria) [17:03:09] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10Nuria) We need to update the protocol for data access so people (or SRE?) subscribe users with analytics-private data permits to this e-mail list [17:04:23] thx much, jo [17:05:07] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623586 (https://phabricator.wikimedia.org/T237047) (owner: 10Joal) [17:08:41] going afk ! [17:08:42] o/ [17:08:50] bye elukey [17:14:26] 10Quarry: In the Quary tool the download as excel creates empty files - https://phabricator.wikimedia.org/T261909 (10Framawiki) Indeed, Due to {T238375}, there were 10G of tmp files in /tmp, making `/` (/dev/vda3) full, again. 775 files. Thanks for the report, it should be good now. [17:14:43] 10Quarry: In the Quary tool the download as excel creates empty files - https://phabricator.wikimedia.org/T261909 (10Framawiki) 05Open→03Resolved a:03Framawiki [17:17:50] 10Analytics-Clusters, 10Analytics-Radar, 10Operations, 10observability, 10Patch-For-Review: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) [17:19:07] 10Analytics-Clusters, 10Analytics-Radar, 10Operations, 10observability, 10Patch-For-Review: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) The buster kafkamon hosts are now live. Will let them settle for a bit and then move on to cleanup/teardown of the old h... [18:32:47] Starting build #58 for job analytics-refinery-maven-release-docker [18:38:23] for anyone in the channel interested, work in progress: https://meta.wikimedia.org/wiki/Research:Emerging_Technical_Communities [18:42:13] Project analytics-refinery-maven-release-docker build #58: 09SUCCESS in 9 min 25 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/58/ [18:49:29] Starting build #25 for job analytics-refinery-update-jars-docker [18:49:46] Project analytics-refinery-update-jars-docker build #25: 09SUCCESS in 28 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/25/ [18:49:50] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.135 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/624219 [18:53:22] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.0.135 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/624219 (owner: 10Maven-release-user) [19:15:25] !log finished deploying refinery and refinery-source, restarting jobs now [19:15:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:15:44] ottomata: you might wanna do https://gerrit.wikimedia.org/r/c/operations/puppet/+/623601 now that refinery's synced to hdfds [19:15:47] *hdfs [19:23:01] milimetric hm ok should I just do that? you will watch and make sure is ok? [19:23:19] maybe we should do that monday? [19:23:22] ottomata: runs every 15th and 30th of the month [19:23:31] ohk [19:23:35] i guess i just merge then? [19:24:07] ottomata: feels ok for me - I wonder of months not having 30 days [19:24:45] ottomata: yeah, we'll watch it on the 15th I guess [19:25:40] merged [19:25:57] Thanks [19:29:18] gone for tonight team o/\ [19:32:47] (03PS1) 10Milimetric: Update aqs to 4be582b [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/624233 [19:52:48] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update aqs to 4be582b [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/624233 (owner: 10Milimetric) [19:58:04] PROBLEM - Check the last execution of mediawiki-history-drop-snapshot on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:17:34] q: does anyone know where i can check for the memory usage threshold for the stat machines where a process gets killed? is it hard-coded for each or a percentage of the machine's RAM? [20:19:59] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10mpopov) [20:31:56] nevermind, i think i found the config here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/analytics/client/limits/user-resource-control.conf.erb