[02:30:00] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), and 3 others: Spin out a tiny EventLogging RL module for lightweight logging - https://phabricator.wikimedia.org/T187207 (10Krinkle) 05Open→03Resolved @Milimetric My test is with (a simpl...
[06:55:29] <icinga-wm>	 PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:55:39] <elukey>	 good morning!
[06:55:42] <elukey>	 -.-
[06:58:48] <elukey>	 dsaez: o/
[06:59:03] <dsaez>	 hi elukey
[06:59:09] <elukey>	 here is a python3 script a bit too hungry of memory on notebook1003
[06:59:15] <elukey>	 *there
[06:59:21] <dsaez>	 yes, sorry I'll clean now
[06:59:45] <elukey>	 thank youuuu
[07:17:44] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey once you've transferred all the files to the definite location, this t...
[07:18:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui)
[08:43:36] <elukey>	 joal: o/
[08:43:38] <elukey>	 bonjour
[08:43:59] <elukey>	 when you are online I'd need to copy some dbstore1002 files to hdfs
[08:44:30] <elukey>	 I thought about /wmf/data/archive/backup/misc/dbstore1002_backup
[08:44:47] <elukey>	 (essentially some .sql files with database dumps)
[08:44:53] <elukey>	 what do you think?
[08:45:27] <elukey>	 (we are backing up some oooold databases that are not going to be migrated to the new dbstore env)
[08:54:08] <elukey>	 in the meantime, I merged the change to migrated the reportupdater hdfs jobs to timers and of course I made a mistake
[08:54:11] <elukey>	 fixing
[09:19:54] <elukey>	 so the interlanguage job kept failing even before the timer
[09:21:47] <elukey>	 that is basically https://phabricator.wikimedia.org/T213219
[09:34:01] <elukey>	 !log removed ./jobs/limn-language-data/interlanguage/.reportupdater.pid in /srv/reportupdater on stat1007 to force the first run of the timer
[09:34:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:34:26] <elukey>	 so empty pid file
[09:35:30] <elukey>	 will ask to milimetric moar info, I am seeing that the task to alert of empty pid file has been declined
[09:38:21] <elukey>	 buuut it looks good :)
[09:38:32] <elukey>	 today we should be able to move report update to timers \o/
[09:38:45] <elukey>	 then we have spark jobs
[09:43:00] <elukey>	 going afk for a bit, cat to the vet :)
[10:28:26] <fdans>	 helloo elukey just got in, is there anything I should do about these alerts? :)
[10:30:58] <elukey>	 fdans: hola! Which ones?
[10:31:13] <elukey>	 the ones from today?
[10:31:33] <elukey>	 notebook1003 is ok, the other would need to be checke
[10:31:35] <elukey>	 *checked
[10:31:48] <fdans>	 the other is the dataloss one?
[10:32:44] <elukey>	 yes exactly
[10:32:51] <elukey>	 there are only two right?
[10:57:11] <icinga-wm>	 RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is OK: OK
[12:26:32] <joal>	 Hi elukey - Sorry late start for me today - 
[12:27:24] <joal>	 elukey: copying files to HDFS is completely fine for me :)
[12:27:36] <joal>	 And the path looks ok
[12:28:38] <elukey>	 hello joal!
[12:28:46] <elukey>	 I have been busy merging the admin puppet change
[12:29:03] <elukey>	 if you give me the green light I'd remove from test+production the unnecessary ssh keys from the Hadoop masters
[12:29:06] <elukey>	 (finally)
[12:29:28] <joal>	 hm - ssh keys from gone people?
[12:31:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10elukey)
[12:31:37] <elukey>	 joal: no no, ssh keys from people that are not listed as admns
[12:31:39] <elukey>	 *admins
[12:31:57] <joal>	 Ah ok
[12:32:08] <elukey>	 so we'll still map users to hdfs
[12:32:13] <elukey>	 but they will not be able to ssh
[12:32:18] <elukey>	 to the hadoop masters
[12:32:22] <joal>	 ok
[12:32:34] <elukey>	 going to test it on the testing cluster now
[12:32:48] <elukey>	 this is the groundwork to then allow us to deploy users on all the workers
[12:32:57] <elukey>	 to use linux containers with yarn
[12:33:15] <joal>	 Makes sense elukey 
[12:33:20] <joal>	 +1 !
[12:33:35] <elukey>	 super, proceeding :)
[12:41:06] <elukey>	 joal: worked like a charm on testing
[12:48:20] <elukey>	 also done in prod: )
[12:48:22] <elukey>	 gooood
[12:48:25] <elukey>	 now lunch :)
[12:48:51] <joal>	 Buon apetitio elukey :)
[12:48:56] <joal>	 oops -i
[13:02:43] <wikibugs>	 (03CR) 10Joal: "Two comments inline - I confirm the newly computed snapshot doesn't have the revision-duplication issue." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal)
[13:53:25] <joal>	 fdans: I'm currently on data quality - Do you want to discuss in da cave?
[13:53:43] <fdans>	 joal: yesss
[13:55:09] <elukey>	 all the report updater jobs have been migrated to timers \o/
[13:55:21] * joal clap clap clap !
[14:02:35] <elukey>	 now last steps are the spark jobs
[14:18:42] <addshore>	 YAY, puppet provided users with no ssh keys :D
[14:19:07] <joal>	 \o/
[14:21:30] <elukey>	 joal: if you have time
[14:21:33] <elukey>	 sudo -u hdfs -copyFromLocal dbstore1002_backup /wmf/data/archive/backup/misc/
[14:21:38] <elukey>	 from my home dir on stat1007
[14:21:42] <elukey>	 does it look reasonable?
[14:22:45] <joal>	 elukey: sudo -u hdfs hdfs dfs -copyFromLocal dbstore1002_backup /wmf/data/archive/backup/misc/
[14:22:54] <joal>	 elukey: sounds good :)
[14:23:04] <joal>	 elukey: parent folder exists, so good
[14:23:20] <elukey>	 ah snap dfs yes
[14:23:21] <elukey>	 thanks :)
[14:23:39] <joal>	 elukey: hdfs dfs actually
[14:23:43] <elukey>	 yep yep
[14:23:58] <elukey>	 for some reason my brain types it correctly in some cases
[14:24:01] <elukey>	 and forgets in others
[14:24:09] <joal>	 :)
[14:24:10] <elukey>	 LRU cache buggy
[14:24:28] <joal>	 I usually forget the "hdfs" bit when sudoing as
[14:24:40] <joal>	 As if the username was the command
[14:25:58] <elukey>	 worked!
[14:26:05] <joal>	 great :)
[14:28:01] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey) Done!  ` elukey@stat1007:/srv/home/elukey$ sudo -u hdfs hdfs dfs -ls /wmf/data/arc...
[14:28:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10DBA, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey)
[14:36:01] <wikibugs>	 (03CR) 10Nuria: [V: 03+2 C: 03+2] Add VisualEditorFeatureUse schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485354 (https://phabricator.wikimedia.org/T212588) (owner: 10Neil P. Quinn-WMF)
[14:43:01] * elukey afk for a bit
[14:49:08] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - DB size - https://phabricator.wikimedia.org/T212763 (10Nuria) @TheSandDoctor  all files are compressed and thus much smaller than what you get once you uncompress them and tried to use them.  Please take a look at: https://meta.wikimedia.org/wiki/Data...
[14:49:16] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - DB size - https://phabricator.wikimedia.org/T212763 (10Nuria) 05Open→03Declined
[14:56:28] <milimetric>	 elukey: yeah we decided the pid problem was just a weird one-off and it's safer to fail when we see it.  The improved reporting of errors should help, as it did with the interlanguage job, thanks for fixing that
[15:01:21] <milimetric>	 joal: thanks for checking the snapshot, it looks like you were right!
[15:01:27] <milimetric>	 though I'm now scared about what that means :)
[15:01:43] <milimetric>	 I'm technically off today, though I'm still not sure if I should work, Stephanie's being strange
[15:15:01] <elukey>	 milimetric: o/ - sure, maybe the error msg should be a bit more verbose, it took me some minute to figure out the problem
[15:15:44] <milimetric>	 true, that'd be an improvement
[15:20:33] <wikibugs>	 (03PS1) 10Milimetric: Clarify pid file error message [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/485673 (https://phabricator.wikimedia.org/T213219)
[15:20:55] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Clarify pid file error message [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/485673 (https://phabricator.wikimedia.org/T213219) (owner: 10Milimetric)
[15:21:30] <milimetric>	 k, that's merged, can be deployed next time we deploy RU
[15:34:47] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings: event_pageissues Turnilo view contains no valid data from before January 5 - https://phabricator.wikimedia.org/T214136 (10mforns) > I see - does that also explain the "No data was returned" error in Superset? I'm not sure, it fails even with latest data...
[15:38:34] <elukey>	 mforns: o/
[15:38:45] <mforns>	 heya Lucaaa
[15:39:06] <elukey>	 holaaa
[15:39:25] <elukey>	 mforns: I'd like to move sanitize_el_analytics to timers if you are ok (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483426/)
[15:39:40] <elukey>	 is there any special thing to do?
[15:39:53] <elukey>	 I am basically replacing the crons with the timers
[15:40:07] <elukey>	 this is the first spark job migrated
[15:40:16] <elukey>	 if it turns out to be ok I'll migrate the rest
[15:43:48] <mforns>	 yes, I've seen that, thanks! It looks good to me
[15:43:59] <mforns>	 is the 4:14 an easter egg?
[15:47:22] <elukey>	 mforns: 4:14?
[15:47:42] <mforns>	 elukey, yes there's a job that starts daily at 4:14h
[15:48:44] <mforns>	 it surprised me, because of the minute 14, why not just 4:00h, and I thought it was an easter egg, like bible passage or sth :]
[15:48:56] <elukey>	 mforns: ah yes! Not sure why there is 14, good point, it should be 4:15
[15:49:04] <elukey>	 basically I took the cron timings from the lines below
[15:49:08] <mforns>	 hehe
[15:49:11] <mforns>	 ok
[15:49:43] <elukey>	 fixed on the fly :)
[15:50:20] <elukey>	 basically this
[15:50:21] <elukey>	 # Puppet Name: monitor_sanitize_eventlogging_analytics
[15:50:21] <elukey>	 15 4 * * * /usr/local/bin/monitor_sanitize_eventlogging_analytics >> /var/log/refinery/monitor_sanitize_eventlogging_analytics.log 2>&1
[15:50:32] <elukey>	 is it ok right?
[15:51:00] <elukey>	 because in this case the cron will be absented and the timer (even for the refine monitor) will be added
[15:53:06] <elukey>	 mforns: --^
[15:53:40] <elukey>	 in any case, would you have 5/10 mins to check with me after the merge that nothing is exploding?
[15:56:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refactor analytics cronjobs to alarm on failure reliably - https://phabricator.wikimedia.org/T172532 (10elukey)
[15:59:51] <mforns>	 elukey, sure!
[16:31:16] <elukey>	 mforns: done1
[16:31:23] <mforns>	 ok
[16:32:34] <wikibugs>	 10Analytics, 10Analytics-Kanban: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10Milimetric)
[16:33:53] <wikibugs>	 10Analytics: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Milimetric) p:05Triage→03Normal
[16:34:51] <wikibugs>	 10Analytics: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Milimetric) a:03elukey
[16:35:08] <wikibugs>	 10Analytics, 10Analytics-Kanban: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Milimetric)
[16:37:16] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move AQS to nodejs 10 - https://phabricator.wikimedia.org/T210706 (10Milimetric) a:03Milimetric
[16:38:10] <wikibugs>	 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10Milimetric) a:05elukey→03Ottomata
[16:38:21] <wikibugs>	 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10Milimetric) switching to Andrew per Luca
[16:42:18] <elukey>	 mforns: from my side is all green, let me know if you see anything weird
[16:42:25] <mforns>	 elukey, ok
[16:47:26] <wikibugs>	 10Analytics, 10Analytics-Kanban: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10elukey) Hey Aaron,  I can see the following:  ` elukey@stat1007:~$ apt-cache policy git-lfs git-lfs:   Installed: 2.3.4-1   Candidate: 2.3.4-1   Version table:      2.6.1-1~bpo9+1 100         100 http:/...
[16:47:44] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10elukey)
[17:02:05] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Discovery, 10EventBus, 10Services (watching): Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 (10JAllemandou) I had in mind that one of the reason for using avro originally...
[17:05:18] <elukey>	 Jan 21 17:00:01 an-coord1001 systemd[1]: Started Spark job for sanitize_eventlogging_analytics.
[17:05:21] <elukey>	 Jan 21 17:00:01 an-coord1001 systemd[1]: sanitize_eventlogging_analytics.service: Main process exited, code=exited, status=203/EXEC
[17:05:24] <elukey>	 mforns: --^
[17:05:26] <elukey>	 no bueno :D
[17:05:33] <mforns>	 mmmmm
[17:05:40] <mforns>	 lookin
[17:05:55] <elukey>	 I think it is a misconfiguration of the timer iselft
[17:08:04] <elukey>	 ahhh snap
[17:08:11] <elukey>	 the script doesn't have a shebang
[17:09:05] <elukey>	 yeah
[17:09:10] <elukey>	 found it mforns 
[17:09:25] <mforns>	 oh
[17:10:38] <elukey>	 sending the patch now
[17:14:04] <elukey>	 mforns: https://gerrit.wikimedia.org/r/485689
[17:14:11] <mforns>	 lookin
[17:14:50] <mforns>	 +1'ed
[17:14:54] <elukey>	 thanks!
[17:14:55] <elukey>	 merging
[17:14:59] <elukey>	 the rest looks good afaics
[17:15:00] <mforns>	 is that needed for timers?
[17:15:05] <mforns>	 ok
[17:19:28] <elukey>	 mforns: yeah basically it seems that systemd does not like anything in ExecStart (basically what to execute) that it is not explicit
[17:19:36] <elukey>	 in this case, a script without the shebang
[17:19:41] <mforns>	 aha
[17:19:46] <elukey>	 didn't know it
[17:19:52] <elukey>	 if you want to play around
[17:19:56] <elukey>	 on an-coord you can type
[17:20:13] <elukey>	 systemctl status sanitize_eventlogging_analytics.service to see how the script is doing
[17:20:22] <elukey>	 now it is not executing since it completed
[17:20:30] <elukey>	 but you can see (code=exited, status=0/SUCCESS)
[17:20:36] <elukey>	 that is the last execution
[17:20:42] <mforns>	 it's dead
[17:20:46] <mforns>	 ok ok
[17:20:54] <elukey>	 well it already finished
[17:21:01] <mforns>	 cooool
[17:21:11] <elukey>	 logs are in journalctl -u sanitize_eventlogging_analytics
[17:21:18] <elukey>	 (with sudo)
[17:21:25] <elukey>	 where they previously logged anywhere?
[17:21:36] <elukey>	 I might have missed it
[17:21:37] <mforns>	 awesome Luca, I had this thing in my todo for many days now
[17:22:03] <mforns>	 elukey, well the real logs I get using yarn logs
[17:22:32] <mforns>	 even the driver ones, because this job is called with deploy-mode=cluster
[17:23:26] <elukey>	 ack, we get some know in journald about spark-submit
[17:23:44] <elukey>	 if everything goes well I'll migrate the rest :)
[17:23:45] <mforns>	 k
[17:36:54] * elukey off!
[19:40:26] <mforns>	 a-team, can you please look at my off-site email and respond until tomorrow's stand-up? thxxxxx
[21:24:13] <wikibugs>	 (03PS1) 10Joal: Update delete/restore in mediawiki-history [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/485710 (https://phabricator.wikimedia.org/T213603)
[21:24:17] <joal>	 milimetric: --^
[21:24:36] <joal>	 milimetric: Please don't hesitate if the changes don't make sense :)
[21:24:41] <milimetric>	 joal: I'm at a total loss, where in the world did the quality job put the results?
[21:24:58] <milimetric>	 isn't it supposed to be in history_check_errors
[21:25:05] <joal>	 milimetric: youre talking about the mediawiki-checker right?
[21:25:20] <joal>	 milimetric: it only writes output in case of error :)
[21:25:25] <milimetric>	 yeah, I ran it, it said 93 errors out of 5463 and no output
[21:25:30] <joal>	 milimetric: no error - no output :)
[21:25:41] <milimetric>	 oh it's below the threshold!
[21:25:42] <joal>	 Ah - errors ?? hm
[21:25:45] <milimetric>	 ok, so I guess it's good
[21:26:15] <joal>	 checking milimetric 
[21:26:24] <milimetric>	 19/01/21 17:33:18 INFO DenormalizedHistoryChecker: DenormMetricsGrowthErrors ratio: (93 / 5463) = 0.017023613399231193
[21:26:37] <milimetric>	 (i'll do your code review)
[21:26:56] <milimetric>	 ok, joal, so I'll make the documentation tweak you suggested and besides that I think we're good to deploy, right?
[21:27:07] <milimetric>	 let's do it together tomorrow maybe
[21:27:07] <joal>	 +1 milimetri !
[21:27:11] <joal>	 Sounds good :)
[21:27:17] <milimetric>	 ok, cool, good night for now
[21:27:38] <joal>	 Bye ;)
[21:27:59] <joal>	 Gone for now team - see y'all tomorrow
[22:20:38] <wikibugs>	 (03CR) 10Nuria: Update mediawiki-history comment and actor joins (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal)
[22:31:38] <wikibugs>	 (03PS9) 10Milimetric: Update mediawiki-history comment and actor joins [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal)
[22:31:49] <wikibugs>	 (03CR) 10Milimetric: Update mediawiki-history comment and actor joins (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal)
[22:56:59] <wikibugs>	 (03CR) 10Nuria: Update mediawiki-history comment and actor joins (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal)
[23:34:18] <nuria>	 milimetric: i think these docs are quite outdated: https://www.mediawiki.org/wiki/Extension:EventLogging#Using_mediawiki-vagrant but I also think i saw a page where you updated this info, right?
[23:42:32] <nuria>	 milimetric: nevermind found ticket and corrected docs: https://www.mediawiki.org/wiki/Extension:EventLogging#Using_mediawiki-vagrant