[08:26:53] joal: o/ [08:27:27] I was able to make Snakebite working with SASL + encryption with the Client Namenode protocol [08:27:39] the code is still horrible but works on an-tool1006 [08:28:13] now I have a clearer idea about how all works, documentation is really not great [08:29:28] then there is the datanode protocol, a completely different beast :D [09:09:31] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Create test Kerberos identities/accounts for some selected users in hadoop test cluster - https://phabricator.wikimedia.org/T212258 (10elukey) The experiment can be called done, one identity was tested and everything looked fine. We are considering enabling Ker... [09:09:37] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Create test Kerberos identities/accounts for some selected users in hadoop test cluster - https://phabricator.wikimedia.org/T212258 (10elukey) [09:54:50] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201063 (10mobrovac) [09:54:53] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Core Platform Team Legacy (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10mobrovac) 05Open→03Resolved Indeed we can! [10:23:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Prepare the Hadoop Analytics cluster for Kerberos - https://phabricator.wikimedia.org/T237269 (10elukey) [10:59:46] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10mobrovac) That makes sense, I guess, since, as you point out, the main point of contention is write access (or the... [11:26:08] * elukey lunch [12:49:53] Awesome work elukey :) [13:02:08] fdans, question for you - It seems today's backfilling job started at 8am UTC - anything not expected happened? [13:06:54] joal: yes, I'm hoping tomorrow it goes ok since I moved the command directly to the crontab instead of running a bash script [13:16:02] ok fdans :) [14:25:13] 10Analytics, 10Analytics-Kanban, 10Inuka-Team: Update ua parser on analytics stack - https://phabricator.wikimedia.org/T237743 (10SBisson) [14:41:22] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) Ok! Great. I don't love the names system and user, as that isn't quite right. The schemas in the ana... [14:41:39] elukey: o/ [14:41:44] i have a client side error loggign meeting during our ops sync [14:41:56] want to do ours before that?, any time in the next 1h 20 mins? [14:42:07] ottomata: o/ [14:42:10] now? [14:42:13] sure! [14:54:50] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10mobrovac) Lol, I know you're always interested in discussions involving bikes of any type :P On a more serious no... [15:09:25] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) Ya, indeed. I think 'production' is an ok name, but you are right in that 'analytics' might not be very... [15:24:38] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Ottomata) Mysqldumping both hosts now: sudo mysqldump --all-databases --skip-lock-tables --quick > mysqldump-$(hostname)-$(... [15:41:24] going afk for a bit! [15:50:14] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [15:54:10] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Jclark-ctr) 05Open→03Resolved alert cleared no errors in icinga [16:08:55] forced remount on notebook1004 [16:09:36] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10fgiunchedi) [16:09:39] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Create client side error schema - https://phabricator.wikimedia.org/T229442 (10fgiunchedi) 05Open→03Resolved Resolving, all done! [16:20:19] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [16:23:57] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Cmjohnson) These are projected to be in eqiad Dec 5th [16:36:18] 10Analytics: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10Nuria) @Danielsberger clarifying a bit: - the upload dataset that will be provided will not have any "save" flags, @lexnasser is finalizing that one - we can work on a different data... [16:36:40] mforns: do you want to talk about alarms if you are arround? [16:37:20] nuria, sure! [16:37:22] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [16:37:24] batcave? [16:37:27] mforns: ya [16:37:40] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [16:37:42] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Vertical: Virtualpageview datastream on MEP - https://phabricator.wikimedia.org/T238138 (10Ottomata) [16:40:09] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [16:40:26] a-team i will miss standup today; better use of data stakeholder meeting conflicts [16:43:59] ack! [17:00:20] ping joal , ottomata [17:09:01] (03CR) 10Nuria: [C: 03+2] Add query to track WDQS updater hitting Special:EntityData (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup) [17:10:32] nuria: sorry am in better use of data stakeholder meeting [17:10:40] ottomata: k [17:10:47] ottomata: can you send slides? [17:12:56] (03Merged) 10jenkins-bot: Add query to track WDQS updater hitting Special:EntityData [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup) [17:13:27] nuria i think i was able to share with you via google drive [17:18:48] (03CR) 10Ladsgroup: "Thanks. When is it going to get deployed? I have no idea how this refinery works." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup) [17:45:37] (03CR) 10Nuria: [C: 03+2] "We normally deploy once a week, on wednesday if there are couple changes, in this case this week this is the only one so we will deploy it" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup) [17:49:18] (03CR) 10Ladsgroup: "Thanks for the note." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup) [17:50:07] 10Analytics, 10WMDE-Analytics-Engineering, 10Wikidata, 10Patch-For-Review, and 2 others: Track WDQS updater UA in wikidata-special-entitydata grafana dashboard - https://phabricator.wikimedia.org/T218998 (10Ladsgroup) It needs to be deployed that will probably happen next Wednesday and then I need to add i... [18:26:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) @Bstorm summary of the next steps, so you can review and approve or not: * Th... [18:38:29] elukey: wow didn't realize how big these dbs are [18:38:45] i'm not sure there is enough space on the db hosts themselves to dump [18:39:07] and i don't see enough on stat1007 either [18:39:12] maybe enbough on stat1006 to do one at a time [18:39:24] i could just do a binary backup of the db files [18:39:29] do we need it as mysql dump? [18:41:31] mforns: yt? i'm going to send an email about cleaning up unused data in home dirs on stat1007 [18:41:36] you are using 449G there, anything you can delete :) [18:41:48] fdans: you too? only 37G though [18:43:57] ottomata: is there a risk to not being able to recover from the binary backup in case needed in the future? Say newer versions of mariadb etc.. [18:44:08] yeah might be [18:44:18] this would be my only fear [18:44:30] wow ezachtes home is 687G on stat1007 [18:44:31] even mysqldump directly compressed takes too much space? [18:44:33] going to make a task abou tthat [18:44:45] yeah it holds a huge amount of things [18:44:51] elukey: i'm not compressing as it goes; i don't think that would work? i'd think it'd need the whole thing out brefore compressing? [18:46:37] 10Analytics: Archive /home/ezacthe data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Ottomata) [18:46:41] ottomata: what I meant is piping it to gzip [18:46:44] yes [18:46:51] but wouldn't gzip need the whole thing in memory to compress it? [18:46:57] or can it stream to disk while compressing? [18:47:19] I think it needs the whole stdout of mysqldump [18:47:24] ya [18:47:31] which won't fit in memory, dunno how it would do that [18:48:53] hm maybe it does? [18:49:00] gonna stop my mysqldumps and try [18:49:48] so in theory the kernel should buffer until gzip reads from the pipe [18:50:01] possibly even paging to disk? [18:50:11] how big is the db more or less? [18:50:25] 2TB [18:50:45] dunno how big the dump will be, assuming bigger (uncompressed) [18:51:49] ok trying with | gzip [18:53:14] ahhahaha 2TB?? [18:53:20] what the hell [18:53:45] we may think about bacula? [18:56:14] yeah? [18:56:32] would that help? [19:02:31] (03PS1) 10Ottomata: Remove db1107 and db1108 from scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/550738 (https://phabricator.wikimedia.org/T236818) [19:02:43] (03PS2) 10Ottomata: Remove db1107 and db1108 from scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/550738 (https://phabricator.wikimedia.org/T236818) [19:02:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove db1107 and db1108 from scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/550738 (https://phabricator.wikimedia.org/T236818) (owner: 10Ottomata) [19:03:24] elukey fyi I just removed refinery deployment from db110[78] hosts [19:03:26] and files [19:03:33] save 17G :) [19:03:35] maybe we'll need it [19:03:49] ottomata: well if the size is acceptable then it will just copy stuff over to our backup infra without us needing to do anything [19:07:42] doesn't bacula just copy stuff from filesystem? [19:08:22] it can also use xtrabackup first to generate a dump of the db in some format [19:08:44] hm, ya but i'd assume that dumps is local. does it stream it to bacula? [19:09:19] it may do that, we'd need to follow up with Jaime [19:28:08] going to dinner! ttl :) [19:36:58] heya nuria you are using 102G on stat1007 [19:37:02] you need all that? :) [19:49:08] starting to look into a new failure, you all might have an idea though :) An oozie task that runs a python spark script on the 5th started failing, the error is that the spark driver can't seem to talk to the cluster manager [19:49:25] hm [19:49:27] is that about when new spark version was deployed? [19:49:28] cluster manager? [19:49:33] yes that's exactly when. [19:49:44] 11-13 03:52:16,240 [spark-dynamic-executor-allocation] WARN org.apache.spark.ExecutorAllocationManager - Unable to reach the cluster manager to request 1 total executors! [19:49:51] repeated over and over in each retry [19:50:01] huh. [19:50:04] but obviously spark talks to it a little bit, because it's able to start the driver :) Ok i'll poke at it [19:56:13] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [19:57:20] ottomata: thanks for the reminder to free up the space :) I didn't realize how much I was consuming [19:58:17] :) ty! [20:02:59] 10Analytics, 10Analytics-Kanban, 10Inuka-Team: Update ua parser on analytics stack - https://phabricator.wikimedia.org/T237743 (10Ottomata) Oh hey, as far as I can tell, this is already done! @JAllemandou updated uap-java with the 0.6.9 version of uap-core on Sept 13th, and refinery-source was deployed with... [20:21:12] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) production scheams & instrumentation schemas? ; [20:32:55] ottomata: 102G? [20:33:00] ottomata: wait [20:33:20] (03CR) 10Mforns: "LGTM Overall! Left 1 controversial comment :]" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/550536 (https://phabricator.wikimedia.org/T234229) (owner: 10Joal) [20:33:34] ottomata: I am not running naything [20:33:37] *anything [20:34:05] ottomata: ah in space! [20:39:00] ottomata: 27G now [20:39:09] :) [21:12:37] ottomata: what is teh tool you use to do system design diagrams? [21:17:50] nuria: i used lucidcharts [21:18:09] https://www.lucidchart.com/ [21:18:19] it is up to so many elements in the diagram [21:18:29] dunno if it is good but it worked for me [22:15:48] this is so odd ... i had exception that looked like mismatched spark versions, so swapped oozie sharelib for spark2.4.4, that errors out with no SparkMain found. I explicitly set SPARK_HOME=/usr/lib/spark2 so stay with spark2.3.1 and remove SPARK_HOME, now it starts but cant find tables that exist in hive. fun :) [22:16:27] ever [22:16:28] spark-2.4.4 [22:16:28] ? [22:16:31] for oozie sharelib? [22:16:39] oozie admin -shareliblist [22:16:58] ebernhardson: ^ [22:17:14] ottomata: mhm, i don't have the - in my workflow.xmls, i have oozie.action.sharelib.for.spark and spark2.3.1 with no dash, checking [22:17:22] yeah, we changed it [22:17:42] a long time ago we had both spark 1 and spark 2 installed [22:17:43] ok lemme try spark-2.4.4, might work [22:17:52] and so all the stuff was spark2-blabla [22:17:54] might explain why no SparkMain at least :) [22:17:57] to be consistent with that we made t he sharelib be spark2.3.1 [22:18:02] which was a bad idea [22:18:07] now we no longer have spark 1 [22:18:12] we just use the version as it should be [22:20:33] 10Analytics, 10Desktop Improvements, 10Event-Platform, 10Readers-Web-Backlog (Kanbanana-2019-20-Q2): [SPIKE 8hrs] How will the changes to eventlogging affect desktop improvements - https://phabricator.wikimedia.org/T233824 (10Jdrewniak) 05Open→03Resolved Looks like this task has been thoroughly analyze... [22:20:36] ottomata: might have done the trick, i see it picking up executors now [22:21:25] nope :P [22:32:21] ottomata: error was same in 2.3.1 and 2.4.4, failed finding table in hive. Adding `--conf spark.sql.catalogImplementation=hive` to the spark invocation has it working now with 2.4.4 (didn't try 2.3.1) [22:33:13] ottomata: that value is already set on stat1007 in /etc/spark2/conf/spark-defaults.conf, and i would imagine the rest of the cluster, so not sure why i had to set that.. [22:40:31] (03CR) 10Nuria: Add hdfs-rsync script based on Hdfs python lib (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/550536 (https://phabricator.wikimedia.org/T234229) (owner: 10Joal) [22:54:35] hmm, that is strange. hm. [22:55:34] hm, so yea the problem looks to be that spark-defaults.conf isn't being read, or at least part of the problem. It's not sourcing the dynamicAllocation conf either (looking in the ui). [22:55:52] trying to put SPARK_HOME=/usr/lib/spark2 back in, would get it to read /usr/lib/spark2/conf/spark-defualts.conf [22:55:53] (03CR) 10Ottomata: Add hdfs-rsync script based on Hdfs python lib (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/550536 (https://phabricator.wikimedia.org/T234229) (owner: 10Joal) [22:56:24] ebernhardson: strange though i think our oozie stuff is workign fine... [22:56:24] hm [22:56:30] i gotta run, let me know if i can help more tomorrow! [22:56:33] sure [23:08:49] PROBLEM - Check the last execution of hdfs-cleaner on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit hdfs-cleaner https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers