[00:04:24] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3935858 (10Nuria) I see vents being inserted: MariaDB [log]> select timestamp from MobileWikiAppShareAFact_12588711 order by timestamp desc limit 10;... [00:13:25] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3935878 (10Nuria) InputDeviceDynamics_17661826 has couple records from today, but it does not look there were many events sent from all-events.log [00:19:17] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3935908 (10Tgr) ``` tgr@deployment-eventlog02:~$ date Thu Feb 1 00:18:32 UTC 2018 tgr@deployment-eventlog02:~$ ack-grep InputDeviceDynamics /srv/log... [00:31:45] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3935956 (10Nuria) >So yes, it is working somewhat, but some of the events seem to get lost (delayed?). They are delayed yes (always, EL insertion in M... [00:40:00] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3935958 (10Tgr) Seems like all the beta eventlogging tables use InnoDB. [00:44:17] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3935967 (10Tgr) ``` MariaDB [log]> show global variables like 'default_storage_engine'; +------------------------+--------+ | Variable_name |... [01:37:16] 10Analytics, 10Analytics-Cluster: Move non-critical monthly jobs to the nice queue - https://phabricator.wikimedia.org/T186180#3936013 (10Tbayer) [01:42:46] 10Analytics, 10Analytics-Cluster: Move non-critical monthly jobs to the nice queue - https://phabricator.wikimedia.org/T186180#3936026 (10Tbayer) [10:41:10] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3936830 (10faidon) I had a look at both `modules/eventlogging/files/eventloggingctl` and `modules/eventlogging/templates/upstart/*`. They all seemed fairly easy to reimplement with systemd (with or wit... [12:17:23] !log Restart cassandra monthly bundle after January deploy [12:17:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:06:59] !log Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives [14:07:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:32] joal, ottomata : i have to attend a budget meeting today (yipie!) so i cannot make standup [16:07:24] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10User-Elukey: EventLogging broken in beta - https://phabricator.wikimedia.org/T185952#3937881 (10Nuria) I am going to restart eventlogging from master (it was on a different changeset than what we have now inprod). Let's take a look ag... [16:08:57] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3937886 (10Ottomata) I actually tried to move to systemd a couple of years ago. I don’t remember the exact details, but there were some serious difficulties in automatically registering groups of proc... [16:19:11] joal: can you explain a bit what happened today with the webrequest jobs you re-started ? [16:22:44] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3937952 (10Ottomata) Ah, found previous ticket: T114199 [16:57:23] Hi nuria_ - Was away grabbing Lino from school [16:57:51] nuria: I did not restart webrequest jobs today - I confirmed their alarms for datraloss were false positive [16:58:14] nuria_: I however restarted cassandra bundle, because of one their jobs (pageview-by-coun [16:58:37] joal: missed you last sentence [16:58:38] try) had been restarted with differences in the previous month [17:00:07] nuria_: cassandra bundle handles jobs with different frequencies (hourly, daily, monthly) [17:00:45] !log killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck. [17:00:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:00:49] nuria_: When one of the job of the bundle is updated (namely the pageview-by-country one - updated for country ISO instead of fullnames) [17:02:08] nuria_: I kill the specific job in the bundle, not the whole thing, to prevent recomputing beginning of the month [17:03:00] nuria: Then I restart the full bundle at the beginning of the next month, to reset the whole thing up [17:08:26] hello from munich! :) [17:08:45] Hi elukey :) [17:12:22] I am surprisingly a working human being now, but I guess that tomorrow my start working time might get affected by jetlag :D [17:22:50] elukey: Please take your time to recover :) [17:27:57] elukey: ya, we'll see you tomorrow [17:30:15] 10Analytics-EventLogging, 10Analytics-Kanban: Timestamp format in Hive-refined EventLogging tables is incompatible with MySQL version - https://phabricator.wikimedia.org/T179540#3938193 (10Ottomata) > Not quite sure what "anyway" referred to exactly This refered to the fact that the timestamp field in EventLog... [17:30:19] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3934659 (10Ottomata) Investigating! The same JsonRefine job has been stuck since Jan 26: https://yarn.wikimedia.org/cluster/app/application_1515441536446_52892 El... [17:34:29] 10Analytics, 10EventBus, 10Services: Use RevisionRecordInserted hook for EventBus revision-create records - https://phabricator.wikimedia.org/T186228#3938203 (10Pchelolo) a:03Pchelolo [17:34:43] 10Analytics, 10EventBus, 10Services (doing): Use RevisionRecordInserted hook for EventBus revision-create records - https://phabricator.wikimedia.org/T186228#3937787 (10Pchelolo) [17:35:12] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3938207 (10Nuria) Let's make sure after this investigation is complete we set up some alarms for jsonrefine similar to the ones we have on the mysql end so we can see... [17:35:52] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3938223 (10Ottomata) Indeed [17:37:06] Hey nuria_ - Gone for diner - We can talk after [17:37:20] joal: yes, iam on annual plan meeting [17:56:48] 420 pics taken, will try to do some adjustments to them and then I'll put them in drive :) [18:27:16] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938382 (10MoritzMuehlenhoff) >>! In T185667#3937886, @Ottomata wrote: > I actually tried to move to systemd a couple of years ago. But T114199 was for jessie, the first Debian release with systemd... [18:30:43] 10Analytics: Estimate how long a new Dashiki Layout for Qualtrics Survey data would take - https://phabricator.wikimedia.org/T184627#3938423 (10egalvezwmf) @Nuria To some extent, this report interface should be how other teams are structuring and thinking about their survey projects. All survey projects have goa... [18:34:26] (03PS3) 10Ottomata: Add configurable transform function to JSONRefine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405800 (https://phabricator.wikimedia.org/T185237) [18:53:31] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938562 (10RobH) [18:53:45] Hey nuria_ - I'm back and will be working a couple hours [19:01:22] joal: heya [19:01:27] can you launcha spark yarn shell? [19:01:36] spark2-shell --master yarn [19:01:37] ?' [19:02:00] hm, ottomata, didn't try - but I suppose it's hanging right? [19:02:17] I think ebernhardson is making a big usage of processors lately :) [19:02:26] yeah its hanging [19:02:36] but, there is room in the cluster [19:03:15] hmm ok, i guess that's it then [19:03:18] ottomata: 2 things altogether - Beginning of month big jobs, and ebernhardson ML jobs that don't take too many containers, but eat a lot of CPUs [19:03:25] hmm [19:03:35] oh you are right [19:03:38] no vcores free [19:03:41] huh [19:03:49] ok, i can wait [19:04:06] ebernhardson: Could we ask you to limit the number of parallel workers? [19:04:58] ottomata: this actually makes the cluster later than ever, even for a beginning of month - I prefer to ask for limitation :) [19:05:43] aye [19:05:54] joal: btw, uploaded new patch for https://gerrit.wikimedia.org/r/#/c/405800/ [19:05:59] def better with a list of funcs [19:06:05] ottomata: Will have a look P) [19:15:10] this is the first time i've actually not been able to even use the cluster [19:15:11] wow [19:16:10] ottomata: beginning of month is a usual busy time - but I must that since Erik started his experiments with xgboost ML, cores are beginning to be a scarce resource :) [19:20:16] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938690 (10RobH) Install blocked by network issue detailed on T186252 for onsite work. [19:23:02] (03CR) 10Joal: [C: 031] "Looks gooooood :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405800 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata) [19:23:17] 10Analytics: Estimate how long a new Dashiki Layout for Qualtrics Survey data would take - https://phabricator.wikimedia.org/T184627#3938715 (10Nuria) I think is probably worth doing some research into generic solutions for survey data visualizations that can be used more broadly, the wikistats design around dat... [19:23:56] joal: let's talk for 30 mins? [19:24:03] sure nuria_ [19:24:12] omw to batcave [19:24:13] nuria_: OMW to the cave [19:31:15] joal: i can limit it a bit more, sure [19:31:23] ebernhardson: Heya :) [19:31:48] ebernhardson: RAM is really not an issue in the jobs you launch - CPUs however are missing ofr others :) [19:32:38] joal: indeed, the training jobs are just cpu heavy. even the largest wikis is only maybe 1.25GB/core [19:34:12] it trains in separate runs though, so it should only run ~2-4 hours, then shutdown and startup another spark job for other wikis, i was hoping that the new jobs would wait around enough in the accepted queue for everything else to move along but i guess not.. [19:34:48] ebernhardson: another way to be nice is to use the nice queue :) [19:35:37] hmm, i should be able to adjust this all eash enough to switch queues. lemme see [19:36:15] ebernhardson: I can switch queue as well - I just don't think it'll help now that the workers are instanciated [19:36:33] ebernhardson: nice will help only at job start [19:37:13] sure, but i mean we have some config that sits in front of the training that figures out how to call spark, i can adjust that so in the future it puts things in the nice queue [19:37:37] ebernhardson: Didn't get that, sorry - That's great :) [19:39:40] joal: where do you think I should put transform functions? [19:39:48] e.g. eventloggingDeduplicate [19:40:09] joal: one other random thing you might know ... for some reason my training task doesn't like to give executors back to yarn even when it doesn't use them for 10 minutes. The training process itself doesn't cache any jars, it reads from disk for each training stage [19:40:13] joal: any idea where to look for why? [19:40:28] err, i mean it doesn't cache any rdd's [19:41:22] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938830 (10RobH) [19:42:44] ebernhardson: hm - weird [19:43:41] ebernhardson: I'm assuming you have dynamic allocation setup [19:44:01] joal: right, dynamic allocation enabled, max executors set, [19:44:52] ebernhardson: maybe the number of cores per executor makes any executor always busy, even if not using all its coreS? [19:45:22] joal: shouldn't, i'm assigning all the cores to a single task (which then gets used in C++ via JNI) [19:46:34] rhmmm [19:46:40] joal: basically what happens is i run 750 jobs that each need 1 executor and run 5-20min (depending on wiki size), then when those finish i run a task on a single executor that takes similar 5-20minutes. That process is done for each wiki. So i expect it to find executors when i issue all the jobs, then give them back when doing the single-executor thing [19:46:45] but its not :S [19:49:42] Right ebernhardson - This is weird indeed ! [19:49:46] oh well i'll poke at it a bit more, and maybe moving to the nice queue for future runs will make it less important [19:50:25] ebernhardson: nice queue and eating 1/3 instead of 1/2 of CPUs would help :) [19:50:47] can you just buy more cpu's to make the current 1/2 be 1/3 ;P [19:50:52] ebernhardson: I've not used spark with JNI, so I have no clue how it messes arouhnd [19:51:06] nah it'll be fine, also we might be switching training algorithms to something twice as efficient cpu wise soon [19:51:22] ebernhardson: We'll buy more CPUs, and you'll bump your third ;) [19:51:28] :) [19:51:50] ebernhardson: would you mind telling me what's the thing? [19:52:12] I mean, twice as efficient, this is quite an improvement ! [19:52:45] joal: its a different library, we are using xgboost right now which has been out longer, but microsoft released lightgbm which is more efficient. Both are for learning gradient booosted tree enesembles [19:52:49] nuria_: we forgot to talk about cassandra restart - have you understood why I restarted it? [19:53:54] got it ebernhardson - Thanks :) [19:56:04] ebernhardson: Just saw that sparkml has a GBT version - You obviously tested it and it was slower than C++ with JNI, right? [19:56:39] joal: right, also we don't really need distributed training. The coordination of distributed training makes things slower. What we actually need is training the same thing with different parameters many many times in parallel [19:57:14] ebernhardson: hyperparam tuning with as fast as possible training - makes sense [19:57:28] ok :) [19:57:39] hmm, since we are talking spark...either of yall know how to pass a struct column to DataFrame dropDuplicates()? [19:57:40] e.g. [19:57:41] Thanks for explanations ebernhardson :) [19:57:56] df.dropDuplicates(Seq("meta.request_id")) [19:59:59] ottomata: hmm, if it doesn't work directly like that i dunno :( [20:00:19] yeah, hmm [20:00:20] ok [20:04:51] joal: nooo,right [20:04:53] ottomata: in quick limited testing ... all i can manage is df.withColumn('temp', F.col('meta.request_id')).dropDuplicates(Seq('temp')).drop('temp') which is not very satisfying [20:04:58] joal: why did we restarted it? [20:05:30] nuria_: to have jobs homogeneously started through bundle [20:06:03] nuria_: bundles need to be started at beginning of month to prevent beginning-of-month reruns [20:06:17] ottomata: never tried - Will check in a bot [20:06:23] sbot/bit [20:06:32] joal: are those jobs running from cluster deployed code? [20:06:51] joal: me no compredou [20:07:08] nuria_: batcave for a minute? [20:07:12] sure [20:18:44] 10Analytics, 10ChangeProp, 10EventBus, 10Services (doing): Support reliable delayed job execution in ChangeProp - https://phabricator.wikimedia.org/T186261#3939040 (10Pchelolo) [20:45:46] ottomata: about your question -- https://stackoverflow.com/questions/44505772/spark-dropduplicates-based-on-json-array-field [20:46:00] Looks like ebernhar|lunch solution is to the one [20:46:49] ohh hmmm [20:46:50] ok cool [20:47:23] 10Analytics: Estimate how long a new Dashiki Layout for Qualtrics Survey data would take - https://phabricator.wikimedia.org/T184627#3939133 (10egalvezwmf) Thanks @Nuria. I think adding a research component to the project next year would be a great idea. I think having a roadmap for this work and what it might l... [20:49:03] (03PS1) 10Ottomata: [WIP] Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) [20:49:10] joal: check it out ^ [20:49:13] not sure of best place to put those [20:50:17] ottomata: not sure where to put that indeed :( [20:50:28] 10Analytics: Estimate how long a new Dashiki Layout for Qualtrics Survey data would take - https://phabricator.wikimedia.org/T184627#3939139 (10Nuria) I think the goals and audiences are perhaps less clear. As I said on our end we would need to think a solution that fits many survey use cases not one, since surv... [20:50:38] ottomata: not sure where to put that indeed :( [20:51:26] ottomata: except from that, given we want to use function list, I'd rather not do a dedup inside refine for EL for instance :) [20:51:48] joal: i'm fine with that either way [20:51:49] ottomata: I'm gonna go to bed - I'll think about a place for those tomorrow :) [20:52:00] list might get a little long, but either way [21:07:37] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3939177 (10Ottomata) Ok, still not sure why that one job was stuck, but after killing it, the next scheduled run seemed to have work. @Tbayer, let us know if it is l... [21:45:05] 10Analytics-Kanban, 10Operations, 10ops-eqiad: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3939349 (10Dzahn) p:05Triage>03Normal [21:47:23] chelsyx: getting back to the app session work today hopefully [21:48:29] nuria_: Thank you very much! [23:34:47] 10Analytics-Kanban, 10Operations, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3931188 (10Dzahn) Yea, that makes sense. I also think it's the easiest way to create a new disk in ganeti and then mount it. [23:35:32] 10Analytics-Kanban, 10Operations, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3939629 (10Dzahn) p:05Triage>03High [23:52:42] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3939682 (10RobH)