[00:49:14] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:06:55] <elukey>	 goood morning
[07:25:39] <joal>	 Good morning
[07:30:42] <elukey>	 bonjour!
[07:30:49] <elukey>	 I was looking to https://github.com/apache/spark/blob/branch-2.4/conf/metrics.properties.template
[07:31:12] <elukey>	 there is a "ConsoleSink" that could be interesting to test, it logs metrics every X seconds
[07:31:20] <joal>	 Nice :)
[07:32:09] <elukey>	 spark 3.0 supports prometheus and jmx 
[07:32:26] <joal>	 Support for prometheus is a nice addon :)
[07:32:49] <elukey>	 we are basically flying blind, I checked the last failure and the jvm seems ooming while storing on the heap
[07:33:11] <elukey>	 but it is not clear to me if it is because fetching too much data from the shuffler, or something else
[07:33:58] <elukey>	 we can try with a beefier max heap for refine, like 8G
[07:34:04] <elukey>	 for the time being I mean
[07:34:20] <elukey>	 we can also test those metrics and see if they can help of if they spam
[07:34:42] <joal>	 hum
[07:36:45] <joal>	 elukey: we've done two changes with ottomata in the last patch: add the logic to be able to get mlaformed data, and add some caching
[07:36:53] <joal>	 elukey: we can try to remove caching
[07:37:58] <joal>	 The thing I don't understand though is: the jobs succeed when run one at a time, and fail when run in conjunction to others - This is a signal that the problem is related to having many jobs run in sequence
[07:38:09] <elukey>	 yep I agree
[07:38:56] <elukey>	 or it could be that a single job gets into a specific corner case, that causes some excutors to fail
[07:39:20] <joal>	 elukey: it would multi-job related nonetheless
[07:40:30] <elukey>	 ?
[07:40:38] <joal>	 elukey: arf sorry
[07:41:05] <joal>	 elukey: I mean the problem is related to refining multiple datasets in a single spark instance
[07:42:17] <elukey>	 the main issue is that we can only speculate every time that an issue occurs, we don't really have any proof other than stacktraces (that are not really helpful)
[07:42:33] <elukey>	 it is a sign of missing metrics and instrumentation
[07:44:11] <joal>	 elukey: I agree we're missing instrumentation - let's try the metrics
[07:45:41] <elukey>	 going to do some tests on hadoop test to see how it goes
[07:46:06] <joal>	 Also elukey: the data of the failed job is not even big: 508Mb
[07:46:24] <joal>	 I feel our problem is related to too many refined dataset in parallel
[07:51:57] <elukey>	 it seems a good lead theory.. is it the parallel task setting that Andrew tuned?
[07:52:05] <joal>	 elukey: yes
[07:52:16] <elukey>	 joal: we can try something like 32 if you want
[07:52:53] <joal>	 elukey: the error is seldom enough that it'll be complicated to actually monitor :(
[07:53:14] <joal>	 elukey: I'm drilling down in logs a bit, trying to make more sense
[07:54:40] <elukey>	 we can also use a very coarse grained approach, like how many refine failures we get.. the main goal for the moment is to reduce toil
[07:54:54] <elukey>	 namely us to re-run things on a daily basis
[07:55:20] <joal>	 yup
[07:55:22] <joal>	 hum
[08:01:11] <joal>	 elukey: Refining 8 dataset partitions in into tables `event`.`Test`, `event`.`HelpPanel`, `event`.`TemplateWizard`, `event`.`ContentTranslationAbuseFilter`, `event`.`HomepageModule`, `event`.`NewcomerTask`, `event`.`SpecialInvestigate`, `event`.`SearchSatisfaction` with local parallelism of 24
[08:01:37] <joal>	 the parallelism level set in the job has no effect - spark uses max-xpu
[08:03:18] <elukey>	 lovely
[08:04:38] <joal>	 And, the job successfully refined 7 datasets, and failed one
[08:04:47] <joal>	 And, all the datasets are very small
[08:04:58] <joal>	 less than 10k records
[08:07:37] <joal>	 elukey: I'm gonna try to re-refine on test DB and monitor
[08:07:56] <elukey>	 -- Gauges ----------------------------------------------------------------------
[08:07:59] <elukey>	 application_1608216118485_10852.25.ExternalShuffle.shuffle-client.usedDirectMemory value = 16777216
[08:08:02] <elukey>	 application_1608216118485_10852.25.ExternalShuffle.shuffle-client.usedHeapMemory value = 16777216
[08:08:25] <joal>	 Nice :)
[08:09:08] <elukey>	 it is not the best of the world since it is basically a periodic dump of values in the logs, on grafana it would be so awesome
[08:09:30] <joal>	 elukey: there is a statsd pluggin I think
[08:10:10] <joal>	 elukey: memory has not been changed for refine_eventlogging_legacy - 64 workers (enough) but 2G executor memory
[08:10:42] <elukey>	 we could think about having a prometheus statsd exporter on all worker nodes
[08:10:50] <elukey>	 that is polled by the prometheus masters
[08:11:50] <elukey>	 joal: ah good catch! Let's make it bigger then
[08:12:12] <joal>	 I'm also trying to replicate the error in my own DB
[08:14:47] <elukey>	 I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/655016 in the meantime
[08:16:12] <joal>	 Works for me elukey 
[08:17:31] <elukey>	 ack merging
[08:18:21] <elukey>	 !log raise default max executor heap size for Spark refine to 4G
[08:18:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:19:17] <wikibugs>	 10Analytics: Kerberos principal for kharlan - https://phabricator.wikimedia.org/T271467 (10Peachey88)
[08:20:56] <wikibugs>	 10Analytics, 10Beta-Cluster-Infrastructure, 10Event-Platform: [Cloud VPS alert] Puppet failure on deployment-schema-2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T271508 (10hashar)
[08:22:21] <joal>	 wow
[08:23:11] <joal>	 elukey: manual test uses 48 local parallelism MEH
[08:24:33] <elukey>	 I am disabling the hdfs-cleaner timer via puppet for few days
[08:24:59] <joal>	 Interesting - event raw data is owned as analytics/analytics making it unreadable to users
[08:25:29] <joal>	 elukey: this means the perm change you did yesterday for data-quality will not be deleted, right?
[08:25:31] <elukey>	 what raw data? I mean, what paths
[08:25:46] <elukey>	 joal: also the /tmp/DataFrameToDruid dir
[08:25:58] <joal>	 Right elukey 
[08:27:01] <joal>	 elukey: hdfs dfs -ls /wmf/data/raw/eventlogging/eventlogging_SearchSatisfaction/hourly/2021/01/07/20
[08:27:34] <joal>	 elukey: folders are correctly owned by analytics/analytics-privatedata-users
[08:27:43] <joal>	 but the files are owned by analytics/an
[08:27:49] <joal>	 analytics/analytics sorr
[08:27:57] <joal>	 This is camus related
[08:28:53] <joal>	 Indeed, same for webrequest
[08:29:17] <joal>	 I'm gonna make my tests run with user analytics, but this is not optimal
[08:29:52] <elukey>	 joal: you can chown some files that you need and avoid the analytics user
[08:29:55] <elukey>	 if it is a problem
[08:30:13] <joal>	 elukey: not a real problem - I'll chown my destination folder :)
[08:30:39] <elukey>	 the main issue is that we have been working without really checking permissions for ages
[08:30:44] <elukey>	 sigh :(
[08:30:58] * joal whistle slowly looking to the skies
[08:44:57] <elukey>	 !log force restart of monitor_refine_eventlogging_legacy_failure_flags.service
[08:45:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:46:09] <elukey>	 !log force restart of check_webrequest_partitions.service on an-launcher1002
[08:46:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:46:32] <joal>	 elukey: I found this: https://issues.apache.org/jira/browse/YARN-4714
[08:47:07] <joal>	 elukey: do we have virtual memory check enabled?
[08:48:19] <wikibugs>	 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since January 4, 2021 - https://phabricator.wikimedia.org/T271511 (10Hydriz)
[08:48:34] <elukey>	 in this case joal I don't think that it is Yarn that kills the spark jvms, or I didn't find traces of that in the node manager logs
[08:49:39] <elukey>	 but I don't find it explicilty set in puppet
[08:51:32] <wikibugs>	 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since January 4, 2021 - https://phabricator.wikimedia.org/T271511 (10Hydriz)
[08:51:40] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:53:04] <icinga-wm>	 RECOVERY - Check the last execution of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:56:52] <joal>	 elukey: I think it is actually - There would be no reason for the JVM to die otherwise, would it?
[09:00:01] <elukey>	 joal: if Xmx is set and the heap space is exhausted, and if -XX:die-on-oom (I don't recall exactly the naming) is set too then yes
[09:03:10] <wikibugs>	 10Analytics: Add Spark metrics - https://phabricator.wikimedia.org/T271513 (10elukey)
[09:03:57] <elukey>	 opened --^ for the metrics
[09:04:21] <joal>	 hm
[09:04:43] <joal>	 Actually on my example the container got killed by yarn (logs say so)
[09:05:56] <joal>	 and my latest test say:  with local parallelism of 72
[09:06:02] * joal doesn't understand :(
[09:07:37] <elukey>	 ah ok if you see logs that yarn acted then definitely
[09:08:00] <elukey>	 I checked briefly this morning and I didn't find any trace, only the die on oom thing
[09:11:05] <joal>	 elukey: my manual rerun of the failed hour succeeded, but with some memory error that spark recovered
[09:12:48] <elukey>	 joal: I re-ran it a while ago as well, didn't know you were working on it, hope that I didn't cause any interference :(
[09:12:49] <joal>	 elukey: I hope that the move to 4G executor will sove our problem, but I still wonder how the thing is even possible (small data only)
[09:13:10] <joal>	 elukey: I ran manual stuff on my own DB, no problemo
[09:13:22] <elukey>	 super okk
[09:13:31] <joal>	 Thanks for the real rerun
[09:13:57] <elukey>	 let's add spark metrics asap and see what happens, I am confident that they'll give us some good info
[09:14:09] <joal>	 aure elukey 
[09:14:15] <joal>	 s/a/s
[09:14:33] <elukey>	 now, the other fun and joy of the day, namely permissions :D
[09:15:30] <joal>	 Ah I'm sorry, I can't work on that: permission denied
[09:15:32] <joal>	 :D
[09:15:35] <joal>	 ok - bad joke
[09:15:47] * elukey cries in a corner
[09:15:50] <elukey>	 :D
[09:17:21] <joal>	 ok - yesterday I got stuck elukey - Can we brainstorm on that?
[09:18:02] <elukey>	 sure gimme 2 mins to inject caffeine into my brain
[09:51:35] <elukey>	 aaand as always joseph was right
[09:51:36] <elukey>	 drwxr-x---  2 hdfs hadoop 4096 Jan  8 09:50 current
[09:51:38] <elukey>	 :)
[09:52:30] <joal>	 elukey: I've been wrong in many ways recently - I'm however glad we caught that problem before being stuck at deploy
[09:54:08] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -eqsin- on alert1001 is CRITICAL: 125.5 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All
[09:54:20] <joal>	 aouch
[09:55:18] <elukey>	 of course, why don't we add joy to this week :D
[09:55:58] <elukey>	 only one host
[09:57:10] <elukey>	 seems timeouts to kafka-jumbo1001
[09:58:25] <wikibugs>	 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since January 4, 2021 - https://phabricator.wikimedia.org/T271511 (10JAllemandou) Thanks for reaching @Hydriz - We indeed have issues publishing data in the past days, due to T271362 as you menti...
[10:01:08] <elukey>	 !log restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well
[10:01:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:03:40] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka webrequest Delivery Errors per second -eqsin- on alert1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All
[10:19:10] <elukey>	 tested my patch joal 
[10:19:11] <elukey>	 drwxr-xr-x 14 hdfs hadoop 4096 Jan  8 10:18 current
[10:19:21] <joal>	 \o/
[10:19:45] <joal>	 all my tests are failing for the moment elukey - I htink I know why, and it seems a bug
[10:19:55] <elukey>	 I am also comparing files inside a prev snapshot, it looks good
[10:21:14] <elukey>	 if you are ok I am going to merge https://gerrit.wikimedia.org/r/c/analytics/refinery/+/654833
[10:22:13] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM - Thanks a lot elukey" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey)
[10:27:31] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] "I have tested this on Hadoop test, and compared file permissions with a previous snapshot, everything looks good. Trace of execution (I de" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey)
[10:44:35] <wikibugs>	 (03PS4) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629)
[10:45:41] <wikibugs>	 (03PS5) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629)
[10:46:43] <wikibugs>	 (03PS6) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629)
[10:58:53] <joal>	 elukey: --^ this has been tested successfully on hive - The only case not verified is the spark one for mediawiki-history-dumps
[10:59:43] <wikibugs>	 (03Abandoned) 10Joal: Update perms of oozie jobs writing public archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654904 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[11:02:05] <elukey>	 joal: one qs - is hdfs dfs -dfs.permissions.umask-mode=$UMASK -mkdir -p $HDFS_DIR_ABS right?
[11:02:20] <elukey>	 a -D seems missing
[11:02:24] <joal>	 elukey: I corrected that but probably forgot to push
[11:02:34] <elukey>	 ahhh okok super
[11:02:53] <wikibugs>	 (03PS7) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629)
[11:02:58] <joal>	 thanks for checking elukey !
[11:04:26] <elukey>	 will do another review in a bit! thank you!
[11:04:39] <joal>	 elukey: in the meantime I'm testing the spark one
[11:40:51] <elukey>	 going afk for lunch ttl!
[12:49:50] <mforns>	 heyyy team!
[13:02:44] <fdans>	 morning team :)
[13:19:41] <wikibugs>	 (03PS1) 10Gehel: Adding Maven wrapper to conform to how other projects are built. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/655052
[13:28:42] <wikibugs>	 (03PS2) 10Gehel: Adding Maven wrapper to conform to how other projects are built. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/655052
[13:34:51] <wikibugs>	 (03PS3) 10Gehel: Adding Maven wrapper to conform to how other projects are built. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/655052 (https://phabricator.wikimedia.org/T271541)
[14:18:09] <elukey>	 hola mforns and fdans 
[14:18:15] <mforns>	 hola hola
[14:18:42] <elukey>	 mforns: I am about to create two tasks for the things that we patched yesterday, with some summary etc..
[14:18:54] <elukey>	 so we can figure out how to proceed
[14:19:13] <mforns>	 elukey: ok, do you want to hear about yesterday's backfillings?
[14:20:28] <mforns>	 I have a question to ask
[14:20:35] <elukey>	 mforns: nono, did everything go fine or something explode?
[14:20:36] <elukey>	 ah okok
[14:20:38] <wikibugs>	 (03PS1) 10Elukey: Drop Debian 9 Stretch support [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067
[14:20:40] <elukey>	 sure do you want to bc?
[14:20:40] <wikibugs>	 (03PS1) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655068
[14:20:42] <wikibugs>	 (03PS1) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655069
[14:20:50] <elukey>	 ta daaan --^
[14:20:58] <mforns>	 elukey: no, let's do it here, it's silent friday
[14:21:19] <elukey>	 mforns: I am fine to meet, but I can understand if you don't want to see me :(
[14:21:29] * elukey cries in a corner
[14:21:39] <mforns>	 elukey: xDDDD, ok, let's meet in bc! 
[14:21:42] <mforns>	 omw
[14:21:45] <elukey>	 super
[14:27:18] <joal>	 \o/ I confirm the spark dumps job works as well
[14:27:48] <elukey>	 wooooowwww
[14:28:03] <joal>	 elukey: with your permission we can merge my patch above above ( ottomata: can you confirm it's ok for you?)
[14:28:37] <joal>	 I need to drop again, but will deploy and restart jobs when back for real (+2h30)
[14:30:37] <elukey>	 joal: I am going to review the change before you come back! 
[14:30:44] <joal>	 <3
[14:31:00] <joal>	 mforns: your opnion is very much welcoime as well if you have time :)
[14:31:29] <mforns>	 ok joal will look!
[14:42:34] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "LGTM! I left 2 minor comments on the script file. Feel free to ignore, though." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[14:47:49] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add in-tech.wikimedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655078 (https://phabricator.wikimedia.org/T271539)
[14:50:38] <wikibugs>	 (03CR) 10Joal: [V: 03+2] "Thanks @mforns for the review" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[14:51:00] <wikibugs>	 (03PS8) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629)
[15:08:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, I left a comment about group read vs read+execute perms, but really not a problem if everything works fine, we can change it later i" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[15:21:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[15:22:36] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Nope this was not intended in this way, sigh, too many things committed, will abandon and re-do" [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655068 (owner: 10Elukey)
[15:22:41] <wikibugs>	 (03Abandoned) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655068 (owner: 10Elukey)
[15:22:53] <wikibugs>	 (03Abandoned) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655069 (owner: 10Elukey)
[15:28:16] <wikibugs>	 (03PS1) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655087
[15:28:18] <wikibugs>	 (03PS1) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655088
[15:29:20] <wikibugs>	 (03Abandoned) 10Ottomata: Bump up refinery-source version to 0.0.143 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654476 (owner: 10Razzi)
[15:29:27] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Drop Debian 9 Stretch support [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067 (owner: 10Elukey)
[15:29:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "We'll be abandoning this repo soon anyway :)" [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067 (owner: 10Elukey)
[15:30:30] <wikibugs>	 (03CR) 10Elukey: "I tried the create_virtual_env script on Docker and it worked fine, but this will probably need to be tested a bit in hadoop test before p" [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655088 (owner: 10Elukey)
[15:34:50] <elukey>	 ottomata: the privatetmp thing works on stat1004 :)
[15:35:32] <elukey>	 ottomata: what plans do you have for the jupyterhub repo? I wanted to upgrade some deps for security fixes, but if it is to be trashed I'll avoid
[15:36:36] <ottomata>	 elukey:  i'm testing the tmp thing on stat1006
[15:36:44] <ottomata>	 triied to run spark from a notebook after kiniting on ssh terminal
[15:36:45] <ottomata>	 Exception in thread "main" java.io.IOException: Failed to create a temp directory (under /tmp) after 10 attempts!
[15:37:00] <ottomata>	 oh
[15:37:08] <ottomata>	 i think its also in the jupyter config lemme see
[15:37:18] <elukey>	 I am running puppet now on 1006 :)
[15:37:33] <elukey>	 ah it was already applied
[15:37:49] <ottomata>	 # We allow users to only write to their homedirs and /dev/shm, and nowhere else
[15:37:49] <ottomata>	 c.SystemdSpawner.readonly_paths = ['/']
[15:37:55] <ottomata>	 pathc oming
[15:38:26] <elukey>	 ah didn't see it thanks
[15:38:33] <ottomata>	 elukey:  newpyter doesn't use the jupyterhub deploy repo
[15:38:48] <ottomata>	 it just runs jupyterhub out of anaconda-wmf
[15:38:51] <ottomata>	 so its all in the deb package
[15:39:00] <elukey>	 ah right didn't know this bit
[15:39:05] <elukey>	 do you prefer me to abandon then?
[15:39:31] <ottomata>	 no no
[15:40:03] <ottomata>	 oh sorry
[15:40:06] <ottomata>	 the jupyterhub one
[15:40:10] <ottomata>	 sorry too many things at once
[15:40:13] <ottomata>	 i dunno, y ou can remove it 
[15:40:15] <ottomata>	 doesn' tmatter i guess
[15:40:26] <elukey>	 no I mean the updates for the deps
[15:40:36] <elukey>	 systemd spawner, jupyterhub, etc..
[15:41:28] <ottomata>	 elukey:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/655090
[15:41:33] <elukey>	 +1ed
[15:42:17] <ottomata>	 elukey:  if you want to upgrade deps for sec fixes
[15:42:19] <ottomata>	 do it in anaconda-wmf
[15:42:36] <ottomata>	 and we can try to get a new .deb with them included out in a week or two
[15:42:38] <elukey>	 that it is a blackhole for me, so good opportunity to learn, if you are ok yes :)
[15:43:03] <ottomata>	 yaya its simple
[15:43:03] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/anaconda-wmf/+/refs/heads/debian/README.debian.md
[15:43:59] <ottomata>	 and extra deps and versions can be specified in
[15:44:00] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/anaconda-wmf/+/refs/heads/debian/extra/
[15:44:07] <ottomata>	 conda-requirements if the things has a conda package
[15:44:09] <ottomata>	 otherwise pip requirements
[15:44:22] <ottomata>	 jupyterhub-systemdspawner
[15:44:22] <ottomata>	  is in conda-requirements
[15:44:31] <ottomata>	 feel free to add versions there
[15:45:47] <ottomata>	 OH oops elukey i fixed the tmp thing  in only the  old swap jupyterhub :/
[15:50:33] <elukey>	 ah fixing it for the other one too! SEnding cr
[15:50:41] <ottomata>	 already did!
[15:50:58] <elukey>	 then you are super
[15:51:05] <elukey>	 all working now on stat1006?
[15:51:14] <ottomata>	 ELUKEY IT WORKS!
[15:51:14] <ottomata>	 yes
[15:51:39] <ottomata>	 much better thank you
[15:52:23] <elukey>	 all right I am going to send an email to announce@, people will need to shutdown their notebook to get the new configs
[15:52:32] <elukey>	 but all the new ones should be able to work fine
[15:54:43] <ottomata>	 actually, puppet is restarting jupyterhub
[15:54:55] <ottomata>	 so their servers will need to be restarted i think the next time they log in
[15:55:00] <ottomata>	 it will prompt them (it did me)
[15:57:04] <elukey>	 in theory no, if jupyterhub gets restarted the notebook should keep running, it is a separate systemd unit (at least this is my understanding)
[15:57:30] <elukey>	 they need to shutdown the notebook since the ephemeral systemd unit will vanish and a new one (without PrivateTmp) will be created
[15:57:33] <elukey>	 does it make sense?
[15:59:08] <ottomata>	 elukey:  it does, but my server was stopped after the puppet run
[15:59:33] <elukey>	 very strange
[16:08:12] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Kerberos credential cache location - https://phabricator.wikimedia.org/T255262 (10elukey) >>! In T255262#6731979, @gerritbot wrote: > Change 650480 **abandoned** by Elukey: > [operations/puppet@production] profile::kerberos::client: change alterna...
[16:19:18] <elukey>	 razzi: goood morning! When you have time let's nuke analytics-tool1004 :D
[16:19:31] <wikibugs>	 10Analytics, 10Research: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) Thanks @Milimetric -- @bmansurov will be leading the technical work on this so we're going to start work on this and greatly appreciate whatever code re...
[16:22:16] <razzi>	 elukey: good morning, yeah, I'll decomm!
[16:24:15] <elukey>	 nice!
[16:30:50] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `analytics-tool1004.eqiad.wmnet` - analytics-tool1004.eqiad.wmnet (**WARN**)   - **Failed...
[16:38:11] <elukey>	 razzi: what is the error msg that you get? I am curious
[16:38:36] <razzi>	 ```Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
[16:38:36] <razzi>	 Generating the DNS records from Netbox data. It will take a couple of minutes.
[16:38:36] <razzi>	 ----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----
[16:38:36] <razzi>	 2021-01-08 16:28:28,779 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
[16:38:36] <razzi>	 2021-01-08 16:30:44,827 [ERROR] Failed to run
[16:38:36] <razzi>	 Traceback (most recent call last):
[16:38:36] <razzi>	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 687, in main
[16:38:37] <razzi>	     batch_status, ret_code = run_commit(args, config, tmpdir)
[16:38:37] <razzi>	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 591, in run_commit
[16:38:38] <razzi>	     netbox.collect()
[16:38:38] <razzi>	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect
[16:38:39] <razzi>	     self._collect_device(device, True)
[16:38:39] <razzi>	   File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 198, in _collect_device
[16:38:40] <razzi>	     if self.addresses[primary.id].dns_name:
[16:39:36] <elukey>	 mmmm weird
[16:39:50] <elukey>	 can you add this to the task? So we can ping Riccardo next week in case
[16:39:58] <elukey>	 the full stack trace I mean
[16:40:21] <razzi>	 elukey: sounds good yeah
[16:40:33] <elukey>	 it is weird since the dns record is still active
[16:41:06] <elukey>	 but https://netbox.wikimedia.org/search/?q=analytics-tool1004 is empty in netbox, probably already wiped
[16:41:15] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10razzi) Here's the error from attempting to decommission analytics-tool1004:  ` Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox Generating the DNS rec...
[16:41:25] <elukey>	 I am wondering if puppet disabled played a role
[16:42:22] <elukey>	 anyway, razzi, can you execute the cookbook for netbox dns to see if it prompts a diff for analytics-tool1004?
[16:43:14] <razzi>	 yup
[16:43:17] <elukey>	 https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:43:30] <elukey>	 you can run sudo cookbook -d sre.dns.netbox "test"
[16:43:49] <elukey>	 I see a warning in icinga but from 30 mins ago, there might be more stuff pending
[16:43:59] <elukey>	 let's use the test run to be sure
[16:44:02] <elukey>	 razzi: --^
[16:46:58] <razzi>	 elukey: I wasn't familiar with the "test" option and kicked it off like so: `sudo cookbook sre.dns.netbox -t T268219 'Decommissioned analytics-tool1004.eqiad.wmnet'`
[16:46:59] <stashbot>	 T268219: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219
[16:48:14] <elukey>	 razzi: ah okok all good, but please wait if the diff shows more records that analytics-tool1004's
[16:49:49] <razzi>	 elukey: looks good, diff only has analytics-tool1004
[16:50:06] <elukey>	 super, lets do it
[16:57:24] <wikibugs>	 10Analytics: Follow up on /tmp/DataFrameToDruid permissions after umask change - https://phabricator.wikimedia.org/T271558 (10elukey)
[16:59:41] <wikibugs>	 (03CR) 10Joal: Update permissions of oozie jobs writing archives (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[17:00:45] <wikibugs>	 (03PS9) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629)
[17:03:11] <wikibugs>	 10Analytics: Follow up on /tmp/analytics permissions after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10elukey)
[17:04:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals, 10cloud-services-team (Kanban): dumps.wikimedia.org/other/pageviews/ appears to be stalled at 20210106-140000 - https://phabricator.wikimedia.org/T271561 (10Xaosflux)
[17:07:04] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Dumps-Generation, 10Wikimedia-Portals, 10cloud-services-team (Kanban): dumps.wikimedia.org/other/pageviews/ appears to be stalled at 20210106-140000 - https://phabricator.wikimedia.org/T271561 (10Bstorm)
[17:16:08] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Dumps-Generation, 10Wikimedia-Portals, 10cloud-services-team (Kanban): dumps.wikimedia.org/other/pageviews/ appears to be stalled at 20210106-140000 - https://phabricator.wikimedia.org/T271561 (10Bstorm) I suspect this is related to https://lists.wikimedia.org/pipermail...
[17:37:54] <wikibugs>	 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10Ladsgroup)
[17:45:19] <joal>	 heya folks
[17:45:49] <joal>	 elukey: I know it's late - Asking permission (about persmissions) to deploy
[17:45:52] <wikibugs>	 10Analytics: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues - https://phabricator.wikimedia.org/T271568 (10elukey)
[17:46:12] <elukey>	 joal: +1 :)
[17:46:20] <joal>	 Starting the process
[17:46:42] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for dpeloy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal)
[17:47:32] <elukey>	 razzi: I added subtask to https://phabricator.wikimedia.org/T270629, basically all the things that we did with Marcel yesterday
[17:47:39] <elukey>	 during the convo-debugging :)
[17:49:16] <razzi>	 elukey: nice, thanks for documenting
[17:49:53] <elukey>	 razzi: if you have questions we can follow up on meet
[17:50:15] <joal>	 !log deploy refinery with scap
[17:50:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:58:07] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy next week." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655078 (https://phabricator.wikimedia.org/T271539) (owner: 10Gerrit maintenance bot)
[18:03:59] <joal>	 !log Deploy refinery onto HDFS
[18:04:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:06:39] * elukey drumrolls
[18:06:52] <joal>	 :)
[18:09:28] <mforns>	 elukey: if you're still here, I think I got why the druid ingestion failed on the 3rd, it was us stopping systemd timers in an-launcher1002 for a couple hours. When we were looking at the time interval earlier today, we didn't account for the ingestion delay of 4-5 hours... This hole of the 6th is going to be covered during the weekend by the daily ingestion job as well, so I think we're good.
[18:10:24] <joal>	 elukey: deploy worked - perms are good :)
[18:10:30] * elukey dances
[18:10:43] <mforns>	 \o/
[18:10:45] <joal>	 elukey: I'm now gonna restart one oozie job and check
[18:10:47] <elukey>	 mforns: ahhhh yes yes it makes sense!
[18:10:56] <elukey>	 thanks a lot for checking <3
[18:11:00] <mforns>	 np!
[18:14:31] <joal>	 !log Restart projectview-hourly job (permissions test)
[18:14:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:15:24] <wikibugs>	 10Analytics, 10Product-Analytics: Update Image usage metric - https://phabricator.wikimedia.org/T271571 (10Isaac)
[18:35:58] <mforns>	 razzi: just fyi, explained to Luca what we discovered yesterday, and he was OK with letting the daily loading automatically backfill the holes during the week-end. So, we will do that :] By Monday, all datasources should be fine.
[18:54:12] <joal>	 !log Restart jobs for permissions-fix (clickstream, mediacounts-archive, geoeditors-public_monthly, geoeditors-yearly, mobile_app-uniques-[daily|monthly], pageview-daily_dump, pageview-hourly, projectview-geo, unique_devices-[per_domain|per_project_family]-[daily|monthly])
[18:54:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:13:11] <wikibugs>	 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10elukey) We are in the process of fixing permissions for files, and restart the rsync jobs that pull data from HDFS to the labstore nodes (serving dumps.wiki...
[20:14:41] <wikibugs>	 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10colewhite) p:05Triage→03Medium a:03colewhite Prometheus-statsd-exporter sits in front of statsv and relays the statsd traffic it recieves to graphite.  [[https://grafan...
[20:15:42] * elukey afk!
[20:17:33] <wikibugs>	 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10colewhite)
[20:17:38] <joal>	 Same - gone for tonight - Have a good weekend team
[20:26:00] <wikibugs>	 (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (033 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[20:36:34] <wikibugs>	 (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[21:03:43] <wikibugs>	 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10Ladsgroup) oh I see. Thanks!
[21:08:14] <wikibugs>	 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10colewhite) 05Open→03Resolved
[21:09:55] <wikibugs>	 (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[21:23:53] <andre__>	 Hi, I'm looking at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest which mentions "wmf.webrequest" tables etc. I'm logged into Hive on stat1007 and I'm in the "default" DB.
[21:24:11] <andre__>	 "show tables;" does not list "wmf.webrequest". Do I first need to create my own DB? Or do I need some special permissions? Or are the docs wrong? I'm obviously clueless.
[21:29:43] <andre__>	 Errrm, okay that wasn't my brightest moment and questions ever when it comes to reading ^.  Still:
[21:29:49] <andre__>	 hive (wmf)> SELECT * FROM webrequest LIMIT 1;
[21:29:49] <andre__>	 FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest"
[21:31:40] <andre__>	 Ah, that's https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Usage_notes - alright, I should read instead of ridiculing myself more here I guess. Cheers & good weekend! :)
[21:48:30] <wikibugs>	 (03PS18) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254)
[21:50:48] <wikibugs>	 (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (035 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[22:02:04] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) (owner: 10Fdans)
[22:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) (owner: 10Fdans)
[22:05:01] <wikibugs>	 (03PS12) 10Milimetric: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans)
[22:05:20] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans)
[22:06:26] <wikibugs>	 (03Merged) 10jenkins-bot: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans)
[22:12:09] <wikibugs>	 (03PS10) 10Milimetric: Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans)
[22:12:16] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans)
[22:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans)
[22:39:46] <wikibugs>	 (03CR) 10Bstorm: "This latest patch set does seem to work fine locally. Naturally it might be a bit different on quarry itself. There is some kind of dev se" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[23:51:11] <icinga-wm>	 RECOVERY - Check the last execution of analytics-dumps-fetch-pageview on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers