[02:19:33] Analytics, Pageviews-API: Provide weekly top pageviews stats - https://phabricator.wikimedia.org/T133575#2546680 (MusikAnimal) @Nuria any chance we could triage this or give an estimate as to feasibility/likelihood of actually happening? //The Signpost// is now relying on [[ https://tools.wmflabs.org/top... [07:52:02] joal: o/ I have no idea about the latency :P [07:52:13] Hi elukey ! [07:52:26] I would wait some days to see if it stays the same.. I observed bumps in the past like these [07:52:46] but I'll try to check the JVM differences [08:09:29] elukey: I wonder if it wouldn't come from a non-restart after user change? [08:13:27] elukey: Do we spen some time now on loading? [08:15:04] joal: I wondered about the non restart after user change but I don't see any good reason... [08:15:18] joal: can you give me ~30 mins that I need to finish some things? [08:15:20] elukey: I don't really know [08:15:24] elukey: sure [08:15:43] elukey: I don't really know relates to the restat, not the 30 mins ;) [08:15:53] take your time [08:22:25] sure :) [08:41:41] joal: I checked the cassandra logs right after the restart and I didn't see anything shiny that explains what happened, plus the diff between old/new jdk seems to be security related [08:41:57] hm ... weird [08:41:57] but the time of the latency drop matches perfectly my restarts [08:42:01] I know :) [08:44:22] it might be something related to the auth cache, not sure, let's see how it goes during the next days.. even p99 looks really good [08:45:15] elukey: I think it's related to new user more than cache [08:45:39] like new-user has kicked-in at that moment, not when you merged [08:47:54] Analytics-EventLogging, Schema-change: Add index on event_type to MultimediaViewerDuration tables. - https://phabricator.wikimedia.org/T70397#2547083 (jcrespo) [08:48:55] joal: no I don't think so since it restabase/aqs started to use it straight away, and the user was already there because I used it manually [08:49:23] elukey: hmmm [08:50:58] I said auth cache because now it might be used more for the aqs user, not sure how evolved it is.. we might assume that it is a standard LRU cache but it might be something more rudimental :D [08:55:18] elukey: I can't imagine it's not rounds-trip related or something like that [08:55:21] elukey: weirdo [08:58:00] joal: this time is weird but good :D [08:58:08] elukey: do you mind proof-reading my prose ? [08:59:11] sure, have you already sent something to me? [08:59:35] elukey: https://etherpad.wikimedia.org/p/backfilling_aqs [08:59:51] WOW! [09:03:00] 10+ [09:03:10] precise and full of examples [09:03:51] elukey: the new openjdk releases also include conversative bugfixes along with the security fixes [09:04:43] moritzm: I tried to review them but didn't find anything super specific.. could it be the responsible of the latency drop that we have observed? [09:06:15] joal: just added two lines to separate between background reading and procedure [09:06:58] joal: the other thing that you could add is the months that needed to be loaded [09:07:00] in a table [09:07:09] so people can tick the ones done [09:09:15] I don't know, but I somewhat doubt that. The Java test suite is really big and they're very conservative about not breaking things, all user-visible changes are usually guarded be config options which need to be explicitly enabled [09:09:24] but maybe let's downgrade one system to validate? [09:13:55] moritzm: we did a user change the day before (namely forcing restbase to use a "regular" user to contact cassandra rather than the 'cassandra' admin one), and I suspect it might be related to the auth cache refreshed.. Either way, the latency heavily dropped so we can leave the system as it is [09:14:04] :) [09:14:27] ok :-) [09:16:59] joal: I'd need to restart the druid java daemons and cassandra on aqs100[456] after the compactions [09:17:26] but I am not sure if you are doing something with druid in these days [09:17:50] * elukey suspects that Joseph forks himself on demand [09:19:06] elukey: good call on adding the month list :) [09:19:31] elukey: no druid activity currently, you can go ahead :) [09:20:34] * elukey observes that Joseph didn't mention anaything about *not* being able to fork [09:20:51] elukey: you learn many when having a child :) [09:20:57] ahahahahah [09:20:59] forking is kinda part of the process [09:21:03] :D [09:21:06] well played, +1 [09:23:01] elukey: I'll send the link to the etherpad on the internal list and update the phab ticket as well [09:24:00] super joal, thanks a lot for this work [09:24:24] elukey: np, mostly needed to hand-off securely while in vacation ! [09:25:52] yep! [09:42:43] Druid cluster restarted [09:42:58] all jvm daemons up and running [09:44:49] awesome elukey :) [09:45:02] a-team, taking a break, will be back later [09:49:00] mobrovac: aloha! I'd need to restart the zookeeper cluster for jvm upgrades [09:59:26] let me know if you are ok with it :) [11:05:41] hi team! [11:06:51] Hi mforns :) [11:06:54] hellooo [11:15:52] elukey: :( [11:16:05] * mobrovac doesn't like zk restarts [11:16:52] I don't either but we'd need to upgrade the openjdk :) [11:17:08] one host at the time shouldn't be that heavy [11:20:06] elukey: any chance we can postpone this for early next week? [11:20:10] it's friday after all [11:20:20] and the last restart attempt didn't go that well [11:22:50] mobrovac: sure, but if you remember correctly the last one was a puppet change, not a regular restart [11:24:00] yup, i know [11:24:15] but, enough stuff went wrong that i fear zk restarts on a friday [11:24:16] :P [11:24:40] it's too nice a day here in lisbon to be stuck with zk [11:27:19] ahahaha okok [11:27:21] got it [11:28:15] mobrovac: enjoy some vino verde for me ;) [11:28:41] will do joal! [11:32:35] mobrovac: I am going to be in Lisbon next week (and then I'll travel to Porto) [11:32:50] oh really? [11:32:58] we have to meet then! [11:33:06] sure! [11:33:14] I'll be there on the 17th [11:33:53] are you going to stay for some time here? [11:35:23] only a week in total, between Lisbon and Porto [11:44:44] joal: I think I asked you this before, sorry, how do I make an oozie coordinator without inputs or datasets? [11:44:51] cool elukey [11:44:52] these python things I'm making are just cron jobs [11:44:54] vacations? [11:45:16] milimetric: Just remove the input related tags :) [11:45:24] ah! :) k [11:46:02] milimetric: I think there should be a check however on existence of data needed in the workflow (maybe) [11:46:12] milimetric: seems reasonable? [11:46:59] joal well in the site matrix case, I guess we could check the internet connection? [11:47:12] and in the other case I guess we could check the db connection? [11:47:55] mobrovac: yep yep vacations! [11:48:02] milimetric: I'm actually very wrong: There's no data dependency on the datasets you generate :) [11:48:16] milimetric: dependecies are for the scala jobs [11:48:30] right, ok [11:48:45] I wouldn't say that's *very* wrong [11:48:46] milimetric: The thing is though, since we need those dataset as inputs, would be great to define them as datasets [11:49:12] milimetric: dependency graphs are in my head currently (subgraphs, maven, oozie...) [11:49:15] :D [11:49:17] yes, that I can do, so I'd define them as output-events, right? [11:49:25] milimetric: exactly ! [11:49:27] k [11:49:37] milimetric: you need to write a dataset.xml file :( [11:49:45] milimetric: I can help / do it if you wish [11:49:55] oh it's ok, I think I did that before [11:50:04] I saw the examples and I copy pasted one already [11:50:09] just gotta fumble through it :) [11:50:11] ok sounds good :) [11:50:17] That's great :) [12:00:19] milimetric: question: in pageDataExtractors, shouldn't we fail if old or new titles are empty (null fails already) [12:00:22] ? [12:15:56] joal: so even 50X went away completely in aqs [12:16:11] ajajajaj [12:16:13] elukey: so far, so good :) [12:16:24] the former should have been a "hahahaah" [12:16:31] meaning that I have no idea [12:16:37] really good then [12:19:03] cool [12:19:21] a-team, need to be AFK for a while, will be back soon [12:19:30] joal, ok, cya [12:26:59] one thing that I noticed in AQS is https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system?panelId=7&fullscreen [12:27:12] so disk throughput on aqs100[123] went down [12:27:24] that might mean less reads from disk [12:30:56] so the good news is that at the moment AQS is not throwing 50X anymore [12:31:17] at least, we haven't been sending them for the past 24 hours [13:01:16] joal: I'm not sure we should fail, probably just discard the event? [13:03:00] hm, elukey don't we have python on the hadoop nodes? [13:03:06] I get /usr/bin/env: python : No such file or directory [13:05:09] elukey@analytics1045:~$ which python [13:05:09] /usr/bin/python [13:05:43] and usr/bin/env python works [13:05:49] but maybe the PATH is different? [13:06:26] milimetric: is it a script run by a specific user, you, etc..? [13:13:18] elukey: oozie [13:13:28] https://hue.wikimedia.org/jobbrowser/jobs/job_1468526822215_87125/single_logs [13:13:40] std_err there had a problem finding python [13:15:04] hm, someone says it might be due to a line ending problem... [13:16:18] but searching for \r doesn't yield anything [13:18:37] milimetric: I am super ignorant but when you create a map/reduce job there should be the possibility to pass env variables and such right? [13:19:53] uh... I have a very shallow understanding of this. I'm not really sure how oozie dispatches its work to map reduce [13:20:10] it uses Yarn [13:20:28] or at least this is my understanding [13:20:43] that in turns handles the whole thing, creating execution containers etc. [13:23:10] milimetric: is it a oozie shell action? Or something different? Because the major point is that whatever is the executor on the Hadoop node it needs to know where to look for python [13:23:30] elukey: yeah, oozie shell [13:23:54] https://gerrit.wikimedia.org/r/#/c/303339/5/oozie/mediawiki/refresh_site_matrix/load-site-matrix.py [13:27:59] hiiii joal can do refinery stuff whenever [13:29:47] sorry elukey that link's to the actual python script this is the oozie: https://gerrit.wikimedia.org/r/#/c/303339/5/oozie/mediawiki/refresh_site_matrix/workflow.xml [13:30:11] (I removed the empty elements and now I get that python error, I'm googling around to see how others do it) [13:33:00] milimetric: so one quick way to check if this is the problem is to replace /usr/bin/env python with /usr/bin/python [13:33:13] k [13:35:27] hm, different file not found now: https://hue.wikimedia.org/jobbrowser/jobs/job_1468526822215_87309/single_logs [13:40:19] milimetric: why not specify /usr/bin/python as the [13:40:22] and the script as an arg? [13:40:47] k, I'll try [13:42:21] I read the docs on this but it still doesn't make sense [13:42:28] what's load-site-matrix.py#load-site-matrix.py ? [13:43:16] i'm pretty sure that is saying to put load-site-matrix.py into hdfs named as load-site-matric.py [13:43:35] so that should still be there with /usr/bin/python as the exec, right? [13:44:01] https://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a3.2.2.1_Adding_Files_and_Archives_for_the_Job [13:44:03] i guess its a symlink [13:44:06] yes [13:44:10] you need the file you want to execute [13:45:56] hey ottomata :) [13:46:11] hiiii [13:46:26] ottomata: o/ [13:46:40] have you 10 minutes for a quick hangout today? [13:46:47] maybe let's say 20 [13:46:47] yes! [13:46:50] now is good [13:47:02] nice! thanks! batcave? [13:47:27] Shall we deploy that thing ottomata ? [13:47:31] yup [13:47:37] oh man elukey shoudl I help joal first? [13:47:41] it'll take a few mins [13:48:23] joal: lets do it, elukey lets hang out in a bit after we are done [13:48:29] ottomata: I think synchro should happen for camus running machine (the rest should be fine) [13:48:40] sure! [13:48:41] ja [13:49:02] I let you stop puppet, stop camus-cron, we wait for camus to finish, then proceed ? [13:49:08] was about to say that too! [13:49:27] milimetric: Just read your comment: makes a lot of sense :) I'll also add a counter for discarded events [13:49:43] sounds good ottomata :) [13:49:51] joal: thanks, cool [13:50:58] joal: i don't see any running camus [13:51:00] and puppet and cron are stopped [13:51:03] so go ahead with deploy [13:51:19] ottomata: Doing ! [13:51:30] !log Deploy refinery from tin [13:51:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [13:51:35] eqi deployment [13:51:37] oops :) [13:52:09] elukey: it looks like we don't have docopt installed, should I change it to argparse or do you all want docopt on the nodes? [13:53:46] milimetric: if it is not a huge change I'd go for argparse that is a bit more portable [13:54:03] * joal likes docopt, but went with argparse on the cluster because of the same issue :( [13:54:23] joal: yeah, i looked for examples and all I found were the docopt scripts in /bin [13:54:27] If a-team agrees, I'd rather install docopt :) [13:54:30] *refinery/bin [13:54:34] right [13:54:41] I think we should either [13:54:42] 1. install docopt [13:54:49] 2. refactor the bin scripts to argparse [13:54:52] hm, we could easily install docopt everywhere [13:54:57] ottomata: Refinery deployed on stat100[24] and analytics1027 [13:54:57] I don't mind either way [13:54:58] i'm surprised its not [13:55:06] ok joal [13:55:12] milimetric: its easy, can be done with puppet in a sec [13:55:14] will do shortly [13:55:30] awesome, thanks for docopt ottomata ! [13:55:38] ottomata / elukey: if we do install docopt, can we get a later version than 0.6.1, there's a weird bug I ran into where you can't specify multi-line usage unless you start each line with an optional param [13:55:39] both are ok with me [13:55:46] joal: will see what i can do [13:55:48] sorry [13:55:48] ottomata: I think you can go ahead and merge/deploy camus puppet thing [13:55:50] milimetric: i mean [13:55:50] (because if the line starts with a dash, it's treated as an option) [13:55:50] ok [13:55:52] joal: [13:56:16] joal: we don't need to restart any running oozie jobs, right? [13:56:26] correct ottomata [13:56:28] the next time we restart them they'll pick up the new unshaded core .jar [13:56:35] and if there are problems we deal with them then? [13:56:42] ottomata: actually, they won't pick it up ;) [13:56:49] unless we change refinery version? [13:56:56] refinery_jar_version [13:56:57] ? [13:57:10] ottomata: correct, plus, only refinery-core is unshaded [13:57:15] no job uses refinery-core [13:57:38] this is why it works easily [13:57:41] aye right, riiiight, the other jars are shaded to included it [13:57:43] aye [13:57:43] aye [13:57:45] cool [13:58:25] joal: puppet has run on an27 with updated cron [13:58:35] watching camus logs waiting for next run [13:58:41] ottomata: great :) [13:59:03] ottomata: no merge message on ops channel ... Have you merged the camus thing? [13:59:07] ottomata: going to send the code review for docopt [13:59:39] ' [14:00:13] mmmm do we also need python3? [14:00:17] it seems installed [14:00:34] joal: ja [14:00:52] https://gerrit.wikimedia.org/r/#/c/304195/ [14:01:15] you guys running python3 or 2? [14:02:24] ottomata: I believe 2 but no reason to not install also the p3 version no? [14:02:25] milimetric: the latest docopt is 0.6.2 [14:02:42] sweet, thx [14:03:06] that has your fix? [14:03:20] milimetric: i have to build a new .deb for it if we want it, trusty has 0.6.1 [14:03:29] oh [14:03:31] https://gerrit.wikimedia.org/r/#/c/304472/1/modules/role/manifests/analytics_cluster/hadoop/worker.pp ? [14:03:36] nice elukey :) [14:03:38] uh... ottomata 0.6.1. is fine if it's easier [14:03:52] I worked around the bug anyway, so long as everyone else is ok working around it [14:03:57] what it means is that this works: [14:03:59] Usage: [14:04:30] blah.py --something ARG --else ARG [14:04:30] [--optional ARG] --another ARG [14:04:34] but this doesn't: [14:04:53] Usage: [14:04:53] blah.py --something ARG --else ARG [14:04:53] --another ARG [14:05:20] so you kind of need to invent as many optional args as you have lines :) [14:05:25] ah interesting, haha [14:05:45] milimetric: i usually just do [14:05:49] Usage: camus [options] [14:05:52] and then list options in [14:05:53] Options: [14:05:55] ... [14:06:10] ottomata: I like it like that [14:06:11] right, properties file is cool [14:06:22] ottomata: camus partition checker working ! [14:06:26] huh, eh? [14:06:27] Thanks for deploy :) [14:06:30] the other way is if you want the script itself to enforce usage [14:06:33] milimetric: no i mean [14:06:35] ottomata: the python thing :) [14:06:54] https://github.com/wikimedia/analytics-refinery/blob/master/bin/camus#L18-L31 [14:07:08] i mean i don't specify the usage all on one line [14:07:15] docopt still enforces is [14:07:16] it [14:07:43] ottomata: docopt won't make some options required and some optional though [14:07:54] if you specify each one of them out, you get to control what usages are lega [14:07:55] ohhh, i guess its a little weird to have 'required options' :p [14:07:55] *legal [14:08:11] not unusual at all [14:08:13] lots of scripts do it [14:08:32] yeah, it's just standard linuxy script behavior [14:08:51] yeah, but required args are also pretty standard, i guess it depends on how many you got [14:08:53] you don't do things lik [14:09:00] cp --source-file f1 --dest-file f2 [14:09:00] you do [14:09:02] cp f1 f2 [14:09:05] !log Deploy refinery on hadoop [14:09:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:09:24] right, and docopt lets you specify those as well, so you can make them required that way [14:09:33] and leave all the options as --option1, etc. [14:09:52] then you can do Usage: some.py positional1 positional2 [options] and that's fine [14:10:09] yeah, if you had only one or two required args, i would say do that, but if you have a lot, it could get really confusing without naming them as option flags [14:10:25] yeah, in this one script I have a lot [14:10:28] aye ok [14:10:33] addshore: Hi ! [14:10:38] hey! [14:10:45] addshore: Just deployed refinery, will start your job [14:10:53] okay! can I give you a start date? [14:10:54] ottomata: this is it before I moved it to the oozie folder: https://gerrit.wikimedia.org/r/#/c/303339/5/bin/sqoop-mediawiki-dbs [14:10:59] (just so you have a concrete example) [14:11:10] haha, nice work around [14:11:24] addshore: you can give, if it's more than 2 month ago, no data [14:11:30] thx :) [14:11:50] 28th of July please joal :) [14:12:02] addshore: Will do ! [14:13:06] elukey: batcave now? [14:13:56] ottomata: sure [14:21:42] addshore: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0064836-160630131625562-oozie-oozi-C/ [14:26:54] joal: I'd like to create the aqsloader user in aqs100[456] with the same pass as the admin one. Afaiu the only change required would be in your .properties right? [14:27:04] I mean, for the backfilling [14:27:58] I didn't realize that loading took so much time :/ [14:30:08] elukey: works for me [14:31:59] elukey: now is a good time [14:32:21] joal: yep I am preparing the script :) [14:32:23] hmmmmm but this means we'll have to install sqoop and the mysql researcher password file everywhere [14:32:25] elukey: almost finished compacting the 2nd month, I'll start a new loading job later on this evening, can be done with new user [14:32:28] I could use a quick brain bounce about that [14:32:57] milimetric: batcave? [14:33:02] omw [14:40:21] !log created the 'aqsloader' user on aqs100[456] cassandra instances following https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/AQS_Tasks [14:40:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:40:25] joal --^ [14:41:27] * joal rubs his hands :) [14:42:36] joal: I'd also need to restart the cassandra instances for the JVM upgrades [14:42:52] elukey: will let you know when ready [14:43:30] super [14:45:30] elukey: unfortunately I think compaction will end during the night (from what I see) [14:53:01] joal: I can restart the jvms tomorrow morning easily [14:55:31] elukey: can be that or monday after another month (if finished) :( [14:55:53] I'll be on vacation on Monday, so might be better tomorrow [14:56:11] are you going to kick off another job during the weekend or on Monday? [14:56:11] k elukey, I'm not happy though [14:56:17] no? [14:56:21] elukey: I'll do it tomorrow morning [14:57:49] joal: why are you not happy?? [14:58:04] cause it makes you work tomorrow morning [14:58:21] ahhhhh [14:58:28] I thought there was something serious [14:58:37] don't worry, I am happy to do it :) [14:58:37] :) [14:58:44] it will take me 10 minutes [14:59:50] urandom: o/ another thing that you might want to see - https://grafana.wikimedia.org/dashboard/db/aqs-elukey [15:00:21] I restarted cassandra the 11th around 13 UTC, when the latency drops heavily [15:00:54] we are not sure if this is due to the new jvm or if it is somehow related to the user switch that we did the day before (cassandra -> aqs for restbase) [15:01:00] a-team: will be 2 minutes late, important phone call [15:01:02] (maybe for the auth cache cleared) [15:01:15] elukey: interesting [15:03:09] 50x went away too [15:03:19] so we are super happy but it is still a bit strange :) [15:06:40] elukey: not sure about the timing here, it sounds like you might be saying that the drop happened *after* the user change was actually applied, is that the case? [15:07:50] i'd be more inclined to believe it was that if only because we didn't see any change in latency after the jvm upgrade, but if the two did not coincide... [15:13:55] urandom: from the https://wikitech.wikimedia.org/wiki/Server_Admin_Log I switched the aqs user (restarting aqs on each node) on the 9th and restarted cassandra (on each node) on the 11th for the jvm upgrade [15:14:19] and on the 11th, after the restart, the magic happened [15:15:23] I assumed that switching *restbase/aqs* to use the aqs user was enough [15:17:27] elukey: yeah [15:21:51] urandom: I thought that maybe the auth cache (not sure how evolved it is) might have been refreshed only after the restart, giving us the final performance improvement. But there are a lot of changes in the Debian changelog for the JVM upgrade, so it might be a more plausible explanation [15:22:15] elukey: yay? [15:22:17] :) [15:22:44] ??? :D ?? [15:23:31] mysteries like this are always disconcerting, but i guess better that it dropped than spiked :) [15:24:47] ah yes for surE! [15:25:35] I just wanted to get your option, I am really happy :) [15:36:22] * milimetric getting lunch [15:54:42] ottomata: if you have a bit of time, could you install siege and ab in aqs1004/5/6? [15:55:11] siege?! [15:55:23] do you need ab on all of them? [15:55:27] probably just one, right? [15:55:48] nuria_: these are in analytics cluster [15:55:51] do you need it installd on those machines/ [15:55:51] ? [15:55:56] you can probably use ab from stat1002 or something [15:56:28] huh, i didn't know about siege [15:56:31] ottomata: tests hit localhost on each aqs machine so no they had to be run locally [15:56:39] hm, ok [15:56:50] ok nuria_ we can install on all 3, let's remember to unisntall though [15:57:05] ottomata: yes, as part of putting cluster in service [15:58:00] done. [15:58:05] a-team logging off! o/ [15:58:13] elukey: ciao aciao [16:00:23] nuria_: urls are defined with localhost [16:00:39] joal: both siege and ab are installed, want to give it a try? [16:00:51] joal: if you prefer Monday that works too [16:01:04] nuria_: monday will be better, need to log off soon [16:01:09] joal: k [16:08:20] lunchin [16:39:05]