[05:21:33] morning! [05:21:50] groceryheist: o/ - have you been able to use pyarrow? [05:23:44] joal: bonjour! Did you guys have problem with timers yesterday ? [05:32:55] I added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers#Shut_off_/_Reset_Icinga_alert_already_fired [05:33:05] this is to reset the alert that fired [05:33:14] (I just reset-failed the sqoop unit) [05:35:07] RECOVERY - Check the last execution of refinery-sqoop-mediawiki-production on an-coord1001 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki-production [05:39:10] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10elukey) Via pyspark2 seems working: ` elukey@stat1004:~$ pyspark2 --master yarn Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Python 3.5.3 (default, Sep 2... [05:55:32] 10Analytics: Upgrade pandas in spark SWAP notebooks - https://phabricator.wikimedia.org/T222301 (10elukey) The 0.19.2 version is the one shipped with Debian Stretch: https://packages.debian.org/search?keywords=python3-pandas When we'll upgrade to Debian Buster (not scheduled yet) we'll jump to 0.23, but upgradi... [06:07:50] user 'analytics' created! \o/ [06:12:19] 10Analytics: Check home leftovers of pirroh - https://phabricator.wikimedia.org/T222140 (10elukey) 05Open→03Resolved a:03elukey Thanks! Everything cleaned up :) [07:14:06] Morning elukey [07:15:16] elukey: I had to play with timers a bit yesterday, and my ignorance was put into lig [07:15:21] light, brightly :) [07:24:02] what was the main doubt about it? I'll update the docs accordingly, there is probably some gap in there :( [07:24:33] elukey: I didn't understood that timers were restarted automatically for instance [07:24:52] elukey: I manually killed the sqoop job, but did not disable the timetr [07:25:10] * joal hides for not having carefully read elukey's writing [07:32:39] nono I think that multiple people didn't get those, so the docs needs a bit of reshaping [07:32:56] the main issue is that you guys should be able to disable puppet [07:33:17] but not sure if it is ideal as thing to ask permissions for to the sre team [07:33:35] elukey: wow - I don't want the powa [07:37:30] what do you think it is best? Keep going with the actual workflow? [07:37:48] I saw that Nuria mentioned in the chan that you guys should be able to operated on systemd timers [07:38:45] elukey: I'm kinda ok with timers-operations, but if we could put puppet out of the equation that'd be great - You know me - cheese AND dessert :) [07:39:29] joal: what do you mean removing puppet? [07:39:47] the main issue is that we define the timers on it [07:40:20] elukey: so, if we disable timers manually, puppet needs to be disabled so that it doesn't reenable it - correct? [07:40:22] I am wondering if there is an easy solution but I don't have any atm :( [07:40:28] yeah exactly [07:40:43] hm [07:40:44] because puppet tries to enforce the presence of what it is defiend [07:41:12] elukey: I think I don't understand what happened yesterday then ... Did Andrew disable puppet? [07:42:32] joal: I think not, I don't see any mentions of it in the chan's log [07:42:57] elukey: That's what I thought - So him disabling the job actually did nothing - right? [07:47:00] joal: it disabled the timer unil puppet ran [07:47:07] I added a Warn to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers#Disable_a_Systemd_timer [07:47:12] should be more visible [07:47:47] also, when a unit fails, to clear the alarm is sufficient to do [07:47:54] systemctl reset-failed name-of-the-unit [07:48:00] (that I did this morning) [07:48:08] (listed now in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers#Shut_off_/_Reset_Icinga_alert_already_fired) [07:48:41] elukey: May I be bold and update a bit the page to facilitate my understanding? [07:49:10] yep yep this is what I am trying to do [07:49:17] please go ahead and fix the docs [07:49:23] it is clear that they are not good atm [07:49:53] elukey: I won't fix, merely change the structure a bit and add a warning - I'll obviously submit for your review once done [07:57:21] elukey: changed made - can you please review quickly? [07:59:02] looks good! [07:59:17] Great :) [07:59:18] Th [07:59:23] thanks [07:59:43] Man, today I have an easy 'enter' key - need to be carefull [08:05:25] joal: speaking of fun things, the analytics user has been created [08:05:36] I've seen that :) [08:05:42] sooo we could start with the prep work to migrate a job (when you'll have time) [08:05:55] I'm in between being happy and afraid :) [08:09:22] elukey: we can do that now if you want [08:09:34] elukey: I'm monitoring sqoop and others, but it leaves me time [08:13:12] ah nice! [08:13:34] so what was the job that you had in mind? (don't recall) [08:17:55] elukey: oozie/aqs [08:25:24] elukey: how shall we proceed ? [08:27:12] (03PS1) 10Elukey: Move aqs hourly to the new 'analytics' user [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507743 (https://phabricator.wikimedia.org/T220971) [08:27:52] joal: IIUC we should deploy, stop oozie, chown the files in /wmf/data/aqs, start oozie [08:28:12] before that, I am wondering if anything needs to be done [08:28:58] elukey: The plan seems good, except maybe for the deploy: for testing, I'd go for manual modification of files (we could even use a -D when restarting oozie [08:30:03] sure, it was to keep things in sync with refinery, but I am ok [08:30:44] elukey: if test succeeds, we'll move everything, so let's do the testing manually? [08:31:01] +1 [08:32:43] I am running sudo -u hdfs hdfs dfs -ls -R /wmf/data/wmf/aqs | awk '$3 == "hdfs" {print $8}' | sudo -u hdfs xargs hdfs dfs -ls to see how many files we need to chown [08:33:18] elukey: probably plenty :( [08:35:23] it is also incredibly slow :D [08:35:29] but it seems doing what we need [08:35:51] the xargs command should be [08:35:58] elukey: calling hdfs for each file is not good - launching jvm every time [08:36:20] xargs hdfs dfs -chown analytics [08:36:26] elukey: good side of things: you won't saturate the new RPC queue :-P [08:36:29] yeah I know but I didn't find another better way [08:37:07] I sense some sarcasm in the air :D [08:37:47] Absolutely not,more a joke ;) [08:38:51] \o/ Sqoop succeeded [08:39:23] elukey: about sqoop-production, shall we reenable the timer? [08:39:58] elukey: hm - timer was not disabled - right [08:40:22] elukey: just reset-failed [08:40:44] elukey: what is the good approach here? Manually launch, since reset-failed has been applied? [08:41:08] Or doing something else to timers to let the automatic thing catch back? [08:41:40] I'm assuming that manual run of the script is the easiest, but prefer to ask :) [08:42:34] so reset failed it only instructs systemd to removed the "failed" state from the last execution of the unit [08:43:01] ok [08:43:07] to check the status of the timer, we can do systemctl list-timers | grep sqoop [08:43:26] Sat 2019-06-01 18:00:00 UTC 4 weeks 2 days left Wed 2019-05-01 18:00:02 UTC 14h ago refinery-sqoop-mediawiki-production.timer refinery-sqoop-mediawiki-production.service [08:43:40] so the next execution will be in 4 weeks [08:43:44] right [08:43:50] so manual run is needed [08:44:01] OKKKKK - Sorry, I'm slow to get itn [08:44:02] ah I thought that you did it already yesterday [08:44:05] or was it a test? [08:44:14] elukey: 2 sqoop jobs :) [08:44:24] elukey: esterday I did the 1st one [08:44:36] I am not following sorry :) [08:44:38] now the second one is needed [08:45:01] elukey: yesterday I ran 'refinery-sqoop-mediawiki' manually [08:45:17] now that it finished, we need to run refinery-sqoop-mediawiki-production [08:45:39] ahhh okok [08:46:02] so if we want to kick off -production, it is sufficient to [08:46:09] systemctl restart (or start) refinery-sqoop-mediawiki-production.service [08:46:15] note that we need the .service [08:46:17] not the timer [08:46:32] (because the timer only execs the .service) [08:46:43] do you want me to start it? [08:51:23] joal: --^ [08:53:42] sorry elukey was away for a minute [08:53:46] elukey: I can do it :) [08:54:21] elukey: actually, I cant :) [08:55:05] yeah :D [08:55:08] elukey: either you do it, or I do as I did for 1st job: sudo -u hdfs /usr/local/bin/refinery-sqoop-mediawiki-production [08:55:15] :) [08:55:17] You tell me [08:55:28] wait a sec, what error did you get? [08:55:44] Failed to start refinery-sqoop-mediawiki-production.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files [08:56:11] elukey: same error when trying without sudo, or sudoing as hdfs [08:56:19] sudo to root asks for passwd [08:56:50] yep yep this is consistent with the sudoers rules, since there is no systemctl * etc... [08:57:18] !log manual start of refinery-sqoop-mediawiki-production.service [08:57:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:58:32] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10JAllemandou) @elukey : in pyspark `import` only uses code on the driver, meaning the local machine in case of shell (`stat1004` in the example you gave). Using... [08:58:41] Thanks for the restart elukey -- Tailing logs ! [09:00:37] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10elukey) >>! In T222254#5152316, @JAllemandou wrote: > @elukey : in pyspark `import` only uses code on the driver, meaning the local machine in case of shell (`s... [09:44:49] elukey: please tell me you've added pyarrow to the cluster lately [09:46:00] Ahhh - no ok [09:46:06] I got a working example :) [09:46:10] sorry for the ping elukey [09:50:13] we don't use pyarrow right? [09:50:34] IIRC Andrew added it to the spark2 package after a req from Diego [09:52:33] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10JAllemandou) Below is a code example that works in local-mode on stat1004 (`pyspark2 --master local[2]`) and fails in yarn mode (`pyspark2 --master yarn`): ` im... [09:52:35] elukey: updated the task --^ [09:52:47] elukey: need to go errand for ~1h [09:52:55] later :) [10:25:37] 10Analytics, 10Patch-For-Review, 10User-Elukey: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 (10elukey) Up to now I have added: * proper HDFS auditing logs * more capacity in handling the RPC call queue * an alarm on RP... [10:25:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 (10elukey) [10:36:44] * elukey lunch! [11:24:55] 10Analytics, 10WMDE-Analytics-Engineering: Install Gensim for Python3 on stat1007 - https://phabricator.wikimedia.org/T222359 (10GoranSMilovanovic) [11:58:37] (03CR) 10Joal: [C: 03+1] "Good for me! Do you want me to merge and deploy, or shall we go with manual testing?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/507743 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [12:52:35] joal: if you have time we can test the aqs job! [12:52:44] yessir [12:53:12] just as sanity check, let's create a simple mapreduce job launched by analytics [12:54:02] ok [12:54:31] elukey: I don't have rights to sudo as analytics on stat1004 [12:54:49] ah very nice [13:00:28] I am seeing wmf tables via sqlContext.sql("show tables").show(100,False) in pyspark yarn [13:00:36] (with sudo -u analytics etc..) [13:01:02] trying a simple select [13:04:01] yep seems working :) [13:04:23] https://yarn.wikimedia.org/cluster/app/application_1555511316215_57945 [13:05:45] joal: I think that we could kill the oozie job, chown and the start manually (with the -D) [13:09:48] ok elukey [13:10:17] !log Kill oozie aqs-hourly-coord [13:10:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:10:41] elukey: I need to let you chown, as I don't have rights [13:12:07] joal: the command is [13:12:08] sudo -u hdfs hdfs dfs -ls -R /wmf/data/wmf/aqs | awk '$3 == "hdfs" {print $8}' | sudo -u hdfs xargs hdfs dfs -chown analytics [13:12:22] does it make sense? [13:13:04] it does - Can't I do 'sudo -u hdfs chown -R analytics hdfs /wmf/data/wmf/aqs/hourly ? [13:13:36] IIRC the -R parameter doesn't exist, but lemme triple check [13:14:02] ah no! [13:14:21] nevermind then, this is better [13:14:45] Ok, going for that [13:14:46] I think originally I tried to come up with a command that was only targeting files owned by hdfs [13:14:50] wait wait [13:14:53] sure [13:15:01] are we 100% sure that all those files are owned by hdfs? [13:15:08] (doing the pendantic/pessimisti) [13:15:10] *pessimistic [13:15:49] I can't imagine otherwise - aqs/hourly is generated by oozie [13:16:26] ack then [13:16:31] are you going to execute it? [13:16:34] otherwise I can do it [13:16:59] or we can pair on bc to be sure [13:17:05] let me know what you prefer :) [13:17:12] elukey: hdfs dfs -ls /wmf/data/wmf/aqs/hourly/*/*/* | grep -e -v '^hdfs' [13:17:16] gives no result :) [13:17:21] Doing it [13:17:54] thanks :) [13:20:30] if this works I am not sure if I should be happy or not, since the amount of jobs to kill/restart/etc.. will be massive :D [13:20:48] elukey: told you so [13:21:04] 10:05:54 < joal> I'm in between being happy and afraid :) [13:21:26] security first! :P [13:21:34] The afraid part of this is getting bigger at every successful step toward this happening [13:21:42] ah joal I added an alarm for the RPC queue length of the namenode this morning [13:21:56] 10 -> warning 20 -> critical [13:21:59] let's see how it goes [13:22:02] * joal wonders how he could live without elukey around <3 [13:22:09] <3 <3 <3 [13:23:34] chown still going [13:23:54] ah yes it will take a bit [13:25:23] 10Analytics, 10Growth-Team, 10Product-Analytics: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Milimetric) @Ottomata / @Pchelolo: short description is `event.mediawiki_revision_create` is missing about 1.5% of the revisions that are in the revision table. My... [13:29:15] 10Analytics, 10Analytics-EventLogging, 10MW-1.34-release, 10Technical-Debt (Deprecation): Remove deprecated EventLogging schema modules - https://phabricator.wikimedia.org/T221281 (10Milimetric) p:05Triage→03High [13:29:33] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.34-release, 10Technical-Debt (Deprecation): Remove deprecated EventLogging schema modules - https://phabricator.wikimedia.org/T221281 (10Milimetric) a:03Milimetric [13:29:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 (10Milimetric) a:03elukey [13:29:56] elukey: finished [13:30:17] drwxr-xr-x - analytics hadoop 0 2015-12-18 19:03 /wmf/data/wmf/aqs/hourly/year=2015 [13:30:22] and others [13:30:39] !log Restarting AQS oozie job with -Duser=analytics parameter [13:30:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:32:14] joal: I'm looking into the dataloss warning to help you [13:32:20] Thanks milimetric :) [13:32:21] (and to make up for missing my ops week last week) [13:32:31] joal: woooooow [13:32:43] milimetric: o/ [13:32:48] how are you feeling today? [13:33:13] elukey: config for job is looking good (says user = analytics) [13:33:19] I'm getting better, thanks, definitely feel ok enough to work [13:33:36] :) [13:33:38] elukey: still missing data to start [13:34:44] milimetric: denormalize for 2019-04 has started, I'm currently checking the 2019-03 recomputed snapshot (looking good so far) [13:34:56] homework for me is requesting the possibility to sudo as analytics for you guys [13:36:55] elukey: I think the day to day thing that we need is just permission to interact with systemd timers like the sqoop ones without bugging you or andrew, the permissions changing and other stuff that needs sudo -u analytics is not something we should need daily [13:37:26] joal: very cool, here if you need me [13:38:22] milimetric: yes I agree but in theory I'd need to ask to the SRE team for you guys to be able to systemctl restart * [13:38:39] k [13:38:43] because the names of the timer units have different names etc.. [13:38:49] and this needs to be approved by the SRE team [13:39:23] the other thing is that we are slowly getting a lot of power on all the cluster (we as analytics team) [13:39:29] and I am starting to be worried a bit [13:39:54] not because I don't trust the team of course, but because the prevention from doing damages is fading away [13:40:16] (due to say a wrong copy/paste of a command, execution of a procedure, etc..) [13:40:49] elukey: we used to be able to do it, they can sudo to hdfs [13:40:59] ottomata: morning :) [13:41:01] why can't they sudo -u analytics systemctl timer stuff [13:41:04] hello! [13:41:23] yeah, I kind of thought it was getting harder to do dangerous stuff [13:41:25] * milimetric nervous [13:41:29] nono what I meant is that I am planning to just ask for systemctl restart *, since we'll need capabilities to restart timers [13:41:37] etc.. [13:41:48] hi a-team, can you please remind me the rsync syntaxis to copy between stat/notebook machines... I remeber that there is double / but not exactly where [13:43:02] https://wikitech.wikimedia.org/wiki/Analytics/FAQ :) [13:45:01] anyway, the worst damage that we can do (as analytics) is already possible with hdfs-user capabilities on data [13:45:32] and everybody is already capable of restarting most of the hadoop daemons [13:45:40] so it shouldn't be much different from now :) [13:45:43] ok opening a task [13:46:04] (now I have some pros/cons to bring up :P) [13:54:03] milimetric: about the dataloss alert - forgot to say - we need to wait for the next hour to be processed !! [13:54:51] yeah, I just realized newly_computed_rows_loss is negative when I left-join instead of inner-join, so I had come to the same conclusion the hard way [14:06:16] ottomata, thx [14:14:50] joal: first hour for aqs completed! \o/ [14:17:41] 10Analytics, 10Continuous-Integration-Config, 10Release-Engineering-Team (Watching / External): Status of analytics/limn-*-data git repositories? - https://phabricator.wikimedia.org/T221064 (10Milimetric) We decided to merge the repositories into our main reportupdater-queries repository. We will use that g... [14:18:11] 10Analytics, 10Core Platform Team, 10MediaWiki-API, 10Patch-For-Review, 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10EvanProdromou) a:05Tgr→03EvanProdromou I'm going to take over this ticket until I figure out what's... [14:19:24] 10Analytics, 10MediaWiki-API, 10Core Platform Team (Modern Event Platform (TEC2)), 10Patch-For-Review, 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10EvanProdromou) [14:19:41] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Hm, sorry for this probably too late idea...but would it be worth building a C based prometheus... [14:20:01] joal: mmm the submitter seems hdfs though, also the mapreduce.job.user.name.. shall we start it again as "analytics" ? [14:27:32] 10Analytics, 10MediaWiki-API, 10Core Platform Team (Modern Event Platform (TEC2)), 10Patch-For-Review, 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Milimetric) @EvanProdromou I'm not sure about the details, but you'll wan... [14:31:00] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) @Ottomata ahhhh you mean in varnishkafka itself! I thought that it would have needed a change in... [14:31:43] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Milimetric) a:03Milimetric [14:37:53] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) I think varnishkafka is already using this callback to write the stats out to the json file. I... [14:40:40] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) But something would need to be created (a simple exporter) to read the json with the Prometheus m... [14:43:56] nuria: we're receiving some snapshots of mforns talk at Strata. It's great that he's there and is giving a talk. :) [14:44:16] nuria: Miriam is there and reporting live. ;) [14:46:34] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Aye yeah I guess there'd have to be some pull service, ya. Maybe converting whatever varnishka... [14:46:55] leila: ya, saw that, this is our 3rd talk at stratta , i think they like us [14:47:10] leila: was great to get those in wassupp, ayayayay [14:51:36] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10ema) >>! In T196066#5153029, @Ottomata wrote: > Hm, sorry for this probably too late idea...but would it... [15:02:49] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Ya good point [15:03:55] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Hm but also, whatever we replace varnishkafka with will likely be librdkafka based. Perhaps a l... [15:19:53] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, and 3 others: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request - https://phabricator.wikimedia.org/T222268 (10Ottomata) There are two scripts, one for hourly and daily: https://github.com/wikimedia/wikimedia... [15:20:00] ottomata, elukey : i think we are going to need to our budget GPUs for serving not only for experimentation, will have more info tomorrow [15:22:19] ok! :) [15:24:59] mmmmm [15:26:12] nuria: do you mean not only for our experiments, but also for other use cases? [15:26:35] I am not even sure if we'll have time to complete hadoop experiments :D [15:34:20] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) a:05Ottomata→03None [15:34:25] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, and 3 others: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request - https://phabricator.wikimedia.org/T222268 (10Ottomata) a:05Ottomata→03None [15:35:52] so I am tempted to kill/restart the aqs job and start it with sudo -u analytics etc.. [15:36:07] just to be super sure that everything works fine [15:41:31] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) Hm, comparing throughput in Kafka: mediawiki_ApiAction: https://grafana.wikimedia.org/d/000000524/kafka-by-topic-graphite?refresh=5m&orgId=1&var-cl... [15:46:36] hdfs dfs -ls /wmf/data/wmf/aqs/hourly/year=2019/month=5/day=2 [15:46:42] drwxr-xr-x - hdfs hadoop 0 2019-05-02 14:06 /wmf/data/wmf/aqs/hourly/year=2019/month=5/day=2/hour=12 [15:46:45] drwxr-xr-x - hdfs hadoop 0 2019-05-02 15:13 /wmf/data/wmf/aqs/hourly/year=2019/month=5/day=2/hour=13 [15:46:56] the others are analytics:hadoop joal --^ [15:47:09] I also discovered that 'analytics' is not on an-coord1001 [15:50:52] ottomata: I have one question about [15:51:11] 10Analytics, 10Analytics-EventLogging, 10Reading-Infrastructure-Team-Backlog, 10Epic, 10Readers-Web-Backlog (Tracking): Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380 (10Jdlrobson) [15:51:13] profile::hadoop::users vs profile::analytics::cluster::users [15:51:16] do we need both? [15:51:44] joal: yt? [15:51:55] need a camus brain bounce [15:53:36] Heya - back [15:53:50] joal bc? [15:53:53] sure ottomata [15:54:40] 10Analytics: Upgrade pandas in spark SWAP notebooks - https://phabricator.wikimedia.org/T222301 (10Groceryheist) I see, for Python packages I usually use pip instead of Debian since python tends to move much faster than Debian. Of course, I'm just managing this for myself and not supporting a whole organization... [15:56:39] elukey: mwar aqs :( [15:57:45] joal: if you want I can start it again! So I'll learn something :) [15:58:06] elukey: starting the job as analytics should do I guess [15:58:21] I have the command if you want [15:59:33] should be the one in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie/Administration#How_to_restart_Oozie_production_jobs right [15:59:41] (with different .properties and -D user override) [16:00:53] should be that indeed :) [16:02:06] elukey: shall I kill the currently running job [16:02:17] going to do it after standup :) [16:02:40] elukey: killing it, so that we don't process any more with hdfs [16:02:59] ah sure but we already have to chown some files [16:03:20] yeah - we could actually delete them, in order to have fast results [16:04:28] milimetric: webrequest-upload hour 13 has been computed :-P [16:10:06] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Nuria) Ping @CDanis, any updates on this? [16:10:43] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) So in Camus logs, I see errors like ` May 1 09:15:07 an-coord1001 camus-mediawiki-analytics[69524]: 19/05/01 09:15:07 ERROR kafka.CamusJob: Offset... [16:16:14] 10Quarry: Unable to log in to Quarry - https://phabricator.wikimedia.org/T222375 (10Miraburubot) [16:26:16] 10Analytics: Move the three sqoop jobs to oozie to ease administration and manual runs - https://phabricator.wikimedia.org/T222378 (10Milimetric) [16:41:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mediawiki-History fixes before deploy - https://phabricator.wikimedia.org/T222141 (10Ottomata) p:05Triage→03High [16:42:11] 10Analytics: Update mediawiki_history with username bein an IP to better define isAnonymous - https://phabricator.wikimedia.org/T222147 (10Ottomata) p:05Triage→03High [16:42:25] 10Analytics: Update mediawiki_history with username bein an IP to better define isAnonymous - https://phabricator.wikimedia.org/T222147 (10Ottomata) 05Open→03Declined [16:42:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mediawiki-History fixes before deploy - https://phabricator.wikimedia.org/T222141 (10JAllemandou) [16:42:49] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Mediawiki History Release - 2019-04 snapshot - https://phabricator.wikimedia.org/T221824 (10JAllemandou) [16:43:01] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mediawiki-History fixes before deploy - https://phabricator.wikimedia.org/T222141 (10JAllemandou) [16:43:06] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Mediawiki History Release - 2019-04 snapshot - https://phabricator.wikimedia.org/T221824 (10JAllemandou) [16:43:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Mediawiki-History fixes before deploy - https://phabricator.wikimedia.org/T222141 (10JAllemandou) [16:43:15] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Mediawiki History Release - 2019-04 snapshot - https://phabricator.wikimedia.org/T221824 (10JAllemandou) [16:43:17] 10Analytics, 10Analytics-Cluster: Upgrade Spark to 2.4.2 - https://phabricator.wikimedia.org/T222253 (10Ottomata) p:05Triage→03Normal [16:43:22] 10Analytics, 10Analytics-Cluster: Upgrade Spark to 2.4.2 - https://phabricator.wikimedia.org/T222253 (10Ottomata) p:05Normal→03Triage [16:43:46] 10Analytics, 10Analytics-Cluster: Upgrade Spark to 2.4.2 - https://phabricator.wikimedia.org/T222253 (10Ottomata) a:03Ottomata [16:48:12] 10Analytics, 10Analytics-Cluster: Upgrade Spark to 2.4.2 - https://phabricator.wikimedia.org/T222253 (10Ottomata) [16:48:14] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10Ottomata) [16:48:55] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10Ottomata) TODO: check if we can use the debian python-arrow package instead of building a version into our custom spark2 package. [16:51:02] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) p:05Triage→03High [16:51:18] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, and 3 others: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request - https://phabricator.wikimedia.org/T222268 (10Ottomata) p:05Triage→03High [16:51:31] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) p:05High→03Triage [16:51:35] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, and 3 others: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request - https://phabricator.wikimedia.org/T222268 (10Ottomata) p:05High→03Triage [16:52:00] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, and 3 others: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request - https://phabricator.wikimedia.org/T222268 (10Ottomata) p:05Triage→03High [16:52:08] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) p:05Triage→03High [16:54:01] 10Analytics, 10Analytics-Cluster: Upgrade Spark to 2.4.2 - https://phabricator.wikimedia.org/T222253 (10Ottomata) [16:54:03] 10Analytics: Upgrade pandas in spark SWAP notebooks - https://phabricator.wikimedia.org/T222301 (10Ottomata) [16:55:17] 10Analytics: Use historical fields in history reduced dataset - https://phabricator.wikimedia.org/T222278 (10Ottomata) p:05Triage→03Normal [16:55:49] 10Analytics: Upgrade pandas in spark SWAP notebooks - https://phabricator.wikimedia.org/T222301 (10Ottomata) p:05Triage→03Normal [17:07:46] 10Analytics, 10WMDE-Analytics-Engineering: Install Gensim for Python3 on stat1007 - https://phabricator.wikimedia.org/T222359 (10Ottomata) Hiya! Can you give us a little more context so we can prioritize this? And/or, can you do this on a jupyter notebook SWAP host? You might be able to pip install there. h... [17:10:08] 10Analytics: Move the three sqoop jobs to oozie to ease administration and manual runs - https://phabricator.wikimedia.org/T222378 (10Ottomata) p:05Triage→03Normal [17:11:02] 10Analytics, 10Analytics-Cluster: Upgrade Spark to 2.4.2 - https://phabricator.wikimedia.org/T222253 (10Ottomata) p:05Triage→03Normal [17:11:04] 10Analytics, 10Analytics-Cluster: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10Ottomata) p:05Triage→03Normal [17:11:10] 10Analytics, 10Analytics-Kanban: Move the three sqoop jobs to oozie to ease administration and manual runs - https://phabricator.wikimedia.org/T222378 (10Ottomata) [17:25:27] elukey: https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave-2 ? [17:25:41] joal: already in! [17:31:09] joal: looks good! [17:31:15] let's see if it explodes or not :P [17:31:20] (brb) [17:31:25] \o/ [17:37:40] they work! [17:37:49] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10CDanis) I got tied up with goal work and incident response and have only had a little time to spend on this. The client that @Ottomata found does look like a good on... [17:38:28] ahahha joal almost! [17:38:29] elukey: interesting - current owning of new files is analytics:hdfs, it was hdfs:hadoop [17:38:32] :) [17:38:38] yeah I was about to say :) [17:38:48] Could we change the default group of analytics on HDFS? [17:39:03] I don't even know it was a thing [17:39:06] elukey: Shall I kill that job, and we restart once updated? [17:39:57] nah I don't think it is needed.. is it a setting to add to the oozie job? [17:40:12] elukey: I don't think so [17:40:34] yeah I agree [17:40:39] so we can leave it running as it is [17:41:04] elukey: ok [17:41:10] :) [17:42:16] In theory the group should be the one from the parent dir [17:42:50] yeah, checked the [17:42:57] parents - not the same! [17:42:57] elukey@stat1004:~$ id analytics [17:42:58] uid=498(analytics) gid=498(analytics) groups=498(analytics),731(analytics-privatedata-users) [17:43:00] elukey@stat1004:~$ id hdfs [17:43:01] this is why :) [17:43:03] uid=117(hdfs) gid=123(hdfs) groups=123(hdfs),120(hadoop) [17:43:17] no wait [17:43:31] mmmmm [17:43:47] I thought it was because 'analytisc' is not in 'hadoop' [17:43:50] that would make sense [17:43:57] but it is not in 'hdfs' either [17:44:04] 10Analytics, 10Analytics-Kanban, 10Anti-Harassment, 10Product-Analytics: Add partial blocks to mediawiki history tables - https://phabricator.wikimedia.org/T211950 (10nettrom_WMF) @JAllemandou : Thanks for your patience while this has been stuck in my backlog! I think this looks great as a first step and w... [17:44:54] ufff [17:45:45] :( [17:46:04] Maybe it uses 'hdfs' by default if it doesn't know which one to use? [17:46:48] in theory from what's written in http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html it should be the perm of the parent dir [17:56:28] ottomata, elukey our analytics budget does not include hadoop expansion but seems we are going to need that if we want to enable content parsing cases [17:56:30] cc joal [17:58:15] nuria: in theory we are replacing all the nodes with less cores/ram with new more powerful, so a little "expansion" will happen anyway from this front [17:58:26] nuria: I agree we need an expansion if we want content-parsing to happen [17:58:27] IIRC during the budget meeting we didn't think we needed more nodes [17:58:42] elukey: I think we must have more nodes for any content parsing use cases [17:58:48] nuria: In m mind though, content-parsing for next year will still be beta [17:59:07] nuria: we can do content-parsing now, but not on regular basis, not too complicated [17:59:09] but this is a new use case right? I am not getting what content parsing is [17:59:27] elukey: process the whole text ever entrered in wikis :) [17:59:49] elukey: the text is already dumped onto hadoop - but we don't use it as of now [17:59:52] elukey, joal , ottomata : no, content parsing is been on our plans for next year for a bit [18:00:18] well more nodes then :D [18:00:22] :) [18:00:44] nuria: My last understanding on global goals was that there was no concrete goal including it [18:00:56] might have completely misunderstood though [18:01:09] joal: I just created a directory under /user/analytics, chowned to analytics:hadoop and used touchz to create a file.. perms works correctly. I suspect it is a oozie thing? [18:01:45] joal: "* ingestion/processing of content" and ML related stuff is been on our very preliminary goals for a while yes [18:01:54] nuria: i don't thinik content is going to affect storage constraints [18:02:07] and we will be doing a cpu/ram expansion as joseph says [18:02:09] ottomata: it will affect processing constrains [18:02:14] just beacuse we are replacing nodes [18:02:28] and i don' tthink we are really full atm with processing usage [18:02:36] ottomata: if we want to build models that look at content to extract features [18:03:08] nuria: i think that all that stuff will take long enough that we'll have a bettere idea of what we need the following FY [18:03:15] and won't really need it in between now and then [18:03:40] (afk making lunch...) [18:03:46] ottomata, elukey : i think as plans for other projects materialize some projects become more concrete [18:04:29] ottomata, elukey : i disagree, it is very likely that we need to support a feature store and while we might not be the main ones buildinging it it seems quite possible that feature build will happen in hadoop [18:04:51] ottomata, elukey the most costly features we have now come (probably) from processing of content [18:05:20] ottomata, elukey : will set up meeting next week to talk baout this cc joal [18:05:29] acl nuria [18:05:59] ack! [18:06:38] joal: if you agree, I'd let the aqs job run as it is since group perms shouldn't be a problem, and then figure out tomorrow morning what is the obscure reason of this [18:06:56] +1 elukey [18:07:11] super :) [18:07:19] going afk then! o/ [18:12:45] Gone for diner as well - Will come back monitor spark job after [18:13:27] ottomata: also we would need to add computing power for modeling building (as well as the ability to build models in hadoop easily so we can move that step out of stats machines) [18:14:16] aye, i just think we are actually under utilized right now. [18:14:31] would be cool to have some historical usage stats...(do we?) [18:16:58] ottomata: i think we do not , but I think if we are successful at helping people in moving from stats machines to hadoop usage will increase in terms of processing dramatically [18:19:38] speaking as a user I don't believe many of the modeling tools I use will work in hadoop [18:20:28] they just tend to assume they have all the cores and memory they need on one machine and don't distribute easily. [18:20:43] I'm thinking about stan, lme4, sklearn. [18:21:13] I've played around with spark.ml and mahout, but they are very far behind when it comes to supporting bespoke modeling. [18:23:36] it's kind of a drag since it means the workflow is build a dataset using spark, take a sample into pandas, and then model using the sample [18:24:01] modeling on the entire dataset would be sooo cool!, but I don't think the technology is really there yet. [18:24:11] I think it's in the roadmap for stan [18:24:36] but I expect it will take years [18:54:03] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Tgr) Where does mediawiki_api_request come from? The EventBus extension? [18:55:15] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) Yes and no. It comes via monolog, just like ApiAction in ApiMain.php. The EventBus extension now has a Monolog handler that is used for the api-re... [18:55:30] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) See the parent task(s) :) [19:17:02] PROBLEM - Check the last execution of camus-mediawiki-analytics on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki-analytics [19:20:08] ^ is me manually running that job, so it can't launch [19:20:11] weird that it pages for that... [19:20:14] alerts* [19:22:32] ottomata: I managed to find some view of cluster usage if you're interested :) [19:25:11] oh ya! [19:25:22] ottomata: cave! [19:36:11] ottomata: want to be my buddy as I delete this data? [19:36:24] ya milimetric gimme 5 [19:36:32] np, I'll be in the cave [19:52:06] 10Analytics, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10EvanProdromou) This is such an interesting ticket cluster. Subscribed! [19:53:36] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Milimetric) Ok, done, and now I'm seeing `"impressionEventSampleRate":0.01` in the data, so all is good going forward. Tha... [19:54:19] 10Analytics, 10Analytics-Kanban: [Bug] Type mismatch for a few other schemas - https://phabricator.wikimedia.org/T216771 (10Milimetric) [19:54:23] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Milimetric) 05Open→03Resolved [19:54:40] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Ottomata) I modified the mediawiki-analytics (avro) camus job to set `kafka.move.to.earliest.offset=true` and ran it a a few times. It seems to have fixed it... [20:10:22] 10Analytics, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10ArielGlenn) Out of curiosity, what's your use case for them, @EvanProdromou ? Not that I can get back to them any time soon :-( [20:46:27] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Tgr) Thanks, so this is from the `api-request` MediaWiki log channel created in [[https://gerrit.wikimedia.org/r/c/mediawiki/core/+/491887|491887]]. I missed... [21:15:04] looks like /mnt/hdfs on stat1007 needs a remount [21:20:48] RECOVERY - Check the last execution of camus-mediawiki-analytics on an-coord1001 is OK: OK: Status of the systemd unit camus-mediawiki-analytics [22:31:15] 10Analytics, 10Analytics-Kanban, 10EventBus: Port usage of mediawiki_ApiAction to mediawiki_api_request - https://phabricator.wikimedia.org/T222267 (10Nuria) @Tgr: the code was merged a while back but it is only as of (last week?) that this is deployed to all wikis.