[08:26:53] <elukey>	 joal: o/
[08:27:27] <elukey>	 I was able to make Snakebite working with SASL + encryption with the Client Namenode protocol
[08:27:39] <elukey>	 the code is still horrible but works on an-tool1006
[08:28:13] <elukey>	 now I have a clearer idea about how all works, documentation is really not great
[08:29:28] <elukey>	 then there is the datanode protocol, a completely different beast :D
[09:09:31] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Create test Kerberos identities/accounts for some selected users in hadoop test cluster - https://phabricator.wikimedia.org/T212258 (10elukey) The experiment can be called done, one identity was tested and everything looked fine. We are considering enabling Ker...
[09:09:37] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Create test Kerberos identities/accounts for some selected users in hadoop test cluster - https://phabricator.wikimedia.org/T212258 (10elukey)
[09:54:50] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201063 (10mobrovac)
[09:54:53] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Core Platform Team Legacy (Watching / External), and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10mobrovac) 05Open→03Resolved Indeed we can!
[10:23:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Prepare the Hadoop Analytics cluster for Kerberos - https://phabricator.wikimedia.org/T237269 (10elukey)
[10:59:46] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10mobrovac) That makes sense, I guess, since, as you point out, the main point of contention is write access (or the...
[11:26:08] * elukey lunch
[12:49:53] <joal>	 Awesome work elukey :)
[13:02:08] <joal>	 fdans, question for you - It seems today's backfilling job started at 8am UTC - anything not expected happened?
[13:06:54] <fdans>	 joal: yes, I'm hoping tomorrow it goes ok since I moved the command directly to the crontab instead of running a bash script
[13:16:02] <joal>	 ok fdans :)
[14:25:13] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Inuka-Team: Update ua parser on analytics stack - https://phabricator.wikimedia.org/T237743 (10SBisson)
[14:41:22] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) Ok!  Great.   I don't love the names system and user, as that isn't quite right.  The schemas in the ana...
[14:41:39] <ottomata>	 elukey: o/
[14:41:44] <ottomata>	 i have a client side error loggign meeting during our ops sync
[14:41:56] <ottomata>	 want to do ours before that?, any time in the next 1h 20 mins?
[14:42:07] <elukey>	 ottomata: o/
[14:42:10] <elukey>	 now?
[14:42:13] <ottomata>	 sure!
[14:54:50] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10mobrovac) Lol, I know you're always interested in discussions involving bikes of any type :P  On a more serious no...
[15:09:25] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) Ya, indeed.  I think 'production' is an ok name, but you are right in that 'analytics' might not be very...
[15:24:38] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Ottomata) Mysqldumping both hosts now:    sudo mysqldump --all-databases --skip-lock-tables --quick > mysqldump-$(hostname)-$(...
[15:41:24] <elukey>	 going afk for a bit!
[15:50:14] <icinga-wm>	 PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs
[15:54:10] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Jclark-ctr) 05Open→03Resolved alert cleared no errors in icinga
[16:08:55] <elukey>	 forced remount on notebook1004
[16:09:36] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10fgiunchedi)
[16:09:39] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Create client side error schema - https://phabricator.wikimedia.org/T229442 (10fgiunchedi) 05Open→03Resolved Resolving, all done!
[16:20:19] <icinga-wm>	 RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs
[16:23:57] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Cmjohnson) These are projected to be in eqiad Dec 5th
[16:36:18] <wikibugs>	 10Analytics: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10Nuria) @Danielsberger clarifying a bit:  - the upload dataset that will be provided will not have any "save" flags, @lexnasser is finalizing that one  - we can work on a different data...
[16:36:40] <nuria>	 mforns: do you want to talk about alarms if you are arround?
[16:37:20] <mforns>	 nuria, sure!
[16:37:22] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata)
[16:37:24] <mforns>	 batcave?
[16:37:27] <nuria>	 mforns: ya
[16:37:40] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata)
[16:37:42] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Vertical: Virtualpageview datastream on MEP - https://phabricator.wikimedia.org/T238138 (10Ottomata)
[16:40:09] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[16:40:26] <ottomata>	 a-team i will miss standup today; better use of data stakeholder meeting conflicts
[16:43:59] <elukey>	 ack!
[17:00:20] <nuria>	 ping joal , ottomata 
[17:09:01] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Add query to track WDQS updater hitting Special:EntityData (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup)
[17:10:32] <ottomata>	 nuria:  sorry am in better use of data stakeholder meeting
[17:10:40] <nuria>	 ottomata: k
[17:10:47] <nuria>	 ottomata: can you send slides?
[17:12:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add query to track WDQS updater hitting Special:EntityData [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup)
[17:13:27] <ottomata>	 nuria i think i was able to share with you via google drive
[17:18:48] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks. When is it going to get deployed? I have no idea how this refinery works." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup)
[17:45:37] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] "We normally deploy once a week, on wednesday if there are couple changes, in this case this week this is the only one so we will deploy it" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup)
[17:49:18] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks for the note." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549859 (https://phabricator.wikimedia.org/T218998) (owner: 10Ladsgroup)
[17:50:07] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10Wikidata, 10Patch-For-Review, and 2 others: Track WDQS updater UA in wikidata-special-entitydata grafana dashboard - https://phabricator.wikimedia.org/T218998 (10Ladsgroup) It needs to be deployed that will probably happen next Wednesday and then I need to add i...
[18:26:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) @Bstorm summary of the next steps, so you can review and approve or not:  * Th...
[18:38:29] <ottomata>	 elukey:  wow didn't realize how big these dbs are
[18:38:45] <ottomata>	 i'm not sure there is enough space on the db hosts themselves to dump
[18:39:07] <ottomata>	 and i don't see enough on stat1007 either
[18:39:12] <ottomata>	 maybe enbough on stat1006 to do one at a time
[18:39:24] <ottomata>	 i could just do a binary backup of the db files
[18:39:29] <ottomata>	 do we need it as mysql dump?
[18:41:31] <ottomata>	 mforns:  yt?  i'm going to send an email about cleaning up unused data in home dirs on stat1007
[18:41:36] <ottomata>	 you are using 449G there, anything you can delete :)
[18:41:48] <ottomata>	 fdans:  you too?  only 37G though
[18:43:57] <elukey>	 ottomata: is  there a risk to not being able to recover from the binary backup in case needed in the future? Say newer versions of mariadb etc..
[18:44:08] <ottomata>	 yeah might be
[18:44:18] <elukey>	 this would be my only fear
[18:44:30] <ottomata>	 wow ezachtes home is 687G on stat1007
[18:44:31] <elukey>	 even mysqldump directly compressed takes too much space?
[18:44:33] <ottomata>	 going to make a task abou tthat
[18:44:45] <elukey>	 yeah it holds a huge amount of things
[18:44:51] <ottomata>	 elukey:  i'm not compressing as it goes; i don't think that would work?  i'd think it'd need the whole thing out brefore compressing?
[18:46:37] <wikibugs>	 10Analytics: Archive /home/ezacthe data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Ottomata)
[18:46:41] <elukey>	 ottomata: what I meant is piping it to gzip
[18:46:44] <ottomata>	 yes
[18:46:51] <ottomata>	 but wouldn't gzip need the whole thing in memory to compress it?
[18:46:57] <ottomata>	 or can it stream to disk while compressing?
[18:47:19] <elukey>	 I think it needs the whole stdout of mysqldump
[18:47:24] <ottomata>	 ya
[18:47:31] <ottomata>	 which won't fit in memory, dunno how it would do that
[18:48:53] <ottomata>	 hm maybe it does?
[18:49:00] <ottomata>	 gonna stop my mysqldumps and try 
[18:49:48] <elukey>	 so in theory the kernel should buffer until gzip reads from the pipe
[18:50:01] <elukey>	 possibly even paging to disk?
[18:50:11] <elukey>	 how big is the db more or less?
[18:50:25] <ottomata>	 2TB
[18:50:45] <ottomata>	 dunno how big the dump will be, assuming bigger (uncompressed)
[18:51:49] <ottomata>	 ok trying with | gzip
[18:53:14] <elukey>	 ahhahaha 2TB??
[18:53:20] <elukey>	 what the hell
[18:53:45] <elukey>	 we may think about bacula?
[18:56:14] <ottomata>	 yeah?
[18:56:32] <ottomata>	 would that help?
[19:02:31] <wikibugs>	 (03PS1) 10Ottomata: Remove db1107 and db1108 from scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/550738 (https://phabricator.wikimedia.org/T236818)
[19:02:43] <wikibugs>	 (03PS2) 10Ottomata: Remove db1107 and db1108 from scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/550738 (https://phabricator.wikimedia.org/T236818)
[19:02:47] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove db1107 and db1108 from scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/550738 (https://phabricator.wikimedia.org/T236818) (owner: 10Ottomata)
[19:03:24] <ottomata>	 elukey fyi I just removed refinery deployment from db110[78] hosts
[19:03:26] <ottomata>	 and files
[19:03:33] <ottomata>	 save 17G :)
[19:03:35] <ottomata>	 maybe we'll need it
[19:03:49] <elukey>	 ottomata: well if the size is acceptable then it will just copy stuff over to our backup infra without us needing to do anything
[19:07:42] <ottomata>	 doesn't bacula just copy stuff from filesystem?
[19:08:22] <elukey>	 it can also use xtrabackup first to generate a dump of the db in some format
[19:08:44] <ottomata>	 hm, ya but i'd assume that dumps is local.  does it stream it to bacula?
[19:09:19] <elukey>	 it may do that, we'd need to follow up with Jaime
[19:28:08] <elukey>	 going to dinner! ttl :)
[19:36:58] <ottomata>	 heya nuria  you are using 102G on stat1007
[19:37:02] <ottomata>	 you need all that? :)
[19:49:08] <ebernhardson>	 starting to look into a new failure, you all might have an idea though :) An oozie task that runs a python spark script on the 5th started failing, the error is that the spark driver can't seem to talk to the cluster manager
[19:49:25] <ottomata>	 hm
[19:49:27] <ebernhardson>	 is that about when new spark version was deployed?
[19:49:28] <ottomata>	 cluster manager?
[19:49:33] <ottomata>	 yes that's exactly when.
[19:49:44] <ebernhardson>	 11-13 03:52:16,240 [spark-dynamic-executor-allocation] WARN  org.apache.spark.ExecutorAllocationManager  - Unable to reach the cluster manager to request 1 total executors!
[19:49:51] <ebernhardson>	 repeated over and over in each retry
[19:50:01] <ottomata>	 huh.  
[19:50:04] <ebernhardson>	 but obviously spark talks to it a little bit, because it's able to start the driver :) Ok i'll poke at it
[19:56:13] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata)
[19:57:20] <sukhe>	 ottomata: thanks for the reminder to free up the space :) I didn't realize how much I was consuming
[19:58:17] <ottomata>	 :) ty!
[20:02:59] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Inuka-Team: Update ua parser on analytics stack - https://phabricator.wikimedia.org/T237743 (10Ottomata) Oh hey, as far as I can tell, this is already done!  @JAllemandou updated uap-java with the 0.6.9 version of uap-core on Sept 13th, and refinery-source was deployed with...
[20:21:12] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) production scheams & instrumentation schemas? ;
[20:32:55] <nuria>	 ottomata: 102G?
[20:33:00] <nuria>	 ottomata: wait
[20:33:20] <wikibugs>	 (03CR) 10Mforns: "LGTM Overall! Left 1 controversial comment :]" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/550536 (https://phabricator.wikimedia.org/T234229) (owner: 10Joal)
[20:33:34] <nuria>	 ottomata: I am not running naything
[20:33:37] <nuria>	 *anything
[20:34:05] <nuria>	 ottomata: ah in space!
[20:39:00] <nuria>	 ottomata: 27G now
[20:39:09] <ottomata>	 :)
[21:12:37] <nuria>	 ottomata: what is teh tool you use to do system design diagrams?
[21:17:50] <ottomata>	 nuria:  i used lucidcharts
[21:18:09] <ottomata>	 https://www.lucidchart.com/
[21:18:19] <ottomata>	 it is up to so many elements in the diagram
[21:18:29] <ottomata>	 dunno if it is good but it worked for me
[22:15:48] <ebernhardson>	 this is so odd ... i had exception that looked like mismatched spark versions, so swapped oozie sharelib for spark2.4.4, that errors out with no SparkMain found.  I explicitly set SPARK_HOME=/usr/lib/spark2 so stay with spark2.3.1 and remove SPARK_HOME, now it starts but cant find tables that exist in hive. fun :)
[22:16:27] <ottomata>	 ever
[22:16:28] <ottomata>	 spark-2.4.4
[22:16:28] <ottomata>	 ?
[22:16:31] <ottomata>	 for oozie sharelib?
[22:16:39] <ottomata>	 oozie admin -shareliblist
[22:16:58] <ottomata>	 ebernhardson: ^
[22:17:14] <ebernhardson>	 ottomata: mhm, i don't have the - in my workflow.xmls, i have oozie.action.sharelib.for.spark and spark2.3.1 with no dash, checking
[22:17:22] <ottomata>	 yeah, we changed it
[22:17:42] <ottomata>	 a long time ago we had both spark 1 and spark 2 installed
[22:17:43] <ebernhardson>	 ok lemme try spark-2.4.4, might work
[22:17:52] <ottomata>	 and so all the stuff was spark2-blabla
[22:17:54] <ebernhardson>	 might explain why no SparkMain at least :)
[22:17:57] <ottomata>	 to be consistent with that we made t he sharelib be spark2.3.1
[22:18:02] <ottomata>	 which was a bad idea
[22:18:07] <ottomata>	 now we no longer have spark 1
[22:18:12] <ottomata>	 we just use the version as it should be
[22:20:33] <wikibugs>	 10Analytics, 10Desktop Improvements, 10Event-Platform, 10Readers-Web-Backlog (Kanbanana-2019-20-Q2): [SPIKE 8hrs] How will the changes to eventlogging affect desktop improvements - https://phabricator.wikimedia.org/T233824 (10Jdrewniak) 05Open→03Resolved Looks like this task has been thoroughly analyze...
[22:20:36] <ebernhardson>	 ottomata: might have done the trick, i see it picking up executors now
[22:21:25] <ebernhardson>	 nope :P
[22:32:21] <ebernhardson>	 ottomata: error was same in 2.3.1 and 2.4.4, failed finding table in hive. Adding `--conf spark.sql.catalogImplementation=hive` to the spark invocation has it working now with 2.4.4 (didn't try 2.3.1)
[22:33:13] <ebernhardson>	 ottomata: that value is already set on stat1007 in /etc/spark2/conf/spark-defaults.conf, and i would imagine the rest of the cluster, so not sure why i had to set that..
[22:40:31] <wikibugs>	 (03CR) 10Nuria: Add hdfs-rsync script based on Hdfs python lib (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/550536 (https://phabricator.wikimedia.org/T234229) (owner: 10Joal)
[22:54:35] <ottomata>	 hmm, that is strange. hm.
[22:55:34] <ebernhardson>	 hm, so yea the problem looks to be that spark-defaults.conf isn't being read, or at least part of the problem. It's not sourcing the dynamicAllocation conf either (looking in the ui).
[22:55:52] <ebernhardson>	 trying to put SPARK_HOME=/usr/lib/spark2 back in, would get it to read /usr/lib/spark2/conf/spark-defualts.conf
[22:55:53] <wikibugs>	 (03CR) 10Ottomata: Add hdfs-rsync script based on Hdfs python lib (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/550536 (https://phabricator.wikimedia.org/T234229) (owner: 10Joal)
[22:56:24] <ottomata>	 ebernhardson:  strange though i think our oozie stuff is workign fine...
[22:56:24] <ottomata>	 hm
[22:56:30] <ottomata>	 i gotta run, let me know if i can help more tomorrow!
[22:56:33] <ebernhardson>	 sure
[23:08:49] <icinga-wm>	 PROBLEM - Check the last execution of hdfs-cleaner on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit hdfs-cleaner https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers